Anatomy of a product quality issue: PoE HAT
One of the neat new features of the Raspberry Pi 3 Model B+ is its support for IEEE 802.3af Power-over-Ethernet (PoE). This standard allows up to 13W of power to be delivered over the twisted pairs in an Ethernet cable without interfering with the transmission of data. The Raspberry Pi board itself provides a PoE-capable Ethernet jack and circuit protection components; the power regulation electronics, which would be too costly and bulky to include on the main board, live on a separate HAT.
When we announced the 3B+, we revealed that an official Raspberry Pi PoE HAT was in the works and, after a few unforeseen production delays, we we released this HAT at the end of August. Feedback was, and remains, generally very positive; but fairly quickly, we started to see some reports from users who were experiencing issues.
The problem they reported was this: when powering certain Raspberry Pi units via the PoE HAT, it was not possible to draw the full rated current from the USB ports.
Our 5V USB output, denoted VBUS, is fed by the main 5V rail via a current-limiting switch. This switch is designed to protect the system by detecting short-circuit, over-current, or reverse-voltage events, and disconnecting the USB ports in response. Our current-limiting switch is set to a limit of just over 1A.
Despite the PoE HAT’s ability to supply up to 2.5A, the experiments we ran in response to the reports suggested that, when it was used to supply some boards, the USB supply would trip out at a much lower current. Mice and keyboards worked fine, but higher-current devices such as wireless dongles and hard disks would fail.
Our initial theory was that the PoE HAT was injecting noise into the Pi via the 5V rail, and that this was somehow upsetting the switch. However, we were able to rule this out, since we found no evidence of high-frequency noise at the input to the switch. Another theory was that the flyback transformer’s close physical proximity to the switch was somehow coupling noise in. But we were able to rule this out as well: we showed that the behaviour persisted when the HAT was connected using a right-angle header, which moves the power electronics away from the Raspberry Pi.
What was happening?
The PoE HAT works by converting the incoming 48V from the Ethernet lines to 5V using a flyback transformer. In simple terms, the primary side of the transformer is switched across the 48V, and energy is stored in the transformer in the form of a magnetic field. The primary is then disconnected and the magnetic field collapses. This changing magnetic field induces a voltage (scaled based on the transformer turns ratio) in the secondary, which is rectified by a schottky diode and output capacitance. This output capacitance is formed from the output capacitors on the PoE HAT itself, the capacitors on the Raspberry Pi 5V rail, and, when the switch is on, the VBUS reservoir capacitors.
The switching frequency of the flyback transformer is relatively low (~100 kHz). This means that when the system is under load, each switching cycle must transfer a relatively large amount of energy. During each cycle, the 5V rail is discharged according to the load on the system, and charged up again by the flyback’s secondary, dumping more energy into the caps. In each cycle, a spike of high current is pushed through the output diode into the capacitors.
To cut a long story short, putting a current probe on the input to switch showed large current spikes, as energy from the flyback made its way into the VBUS reservoir capacitors. This was expected. However, it turned out that the switch was erroneously registering these spikes as true over-current events. The switch is supposed to have a filter that allows it to ignore brief spikes, but we discovered that only one of the two approved versions of the switch did this correctly.
If it’s not been tested, it’s broken
It’s a truism that if you don’t test an aspect of a design, it will certainly be broken. Those of us with a Broadcom background sometimes refer to this as Alan Morgan’s rule, after its most enthusiastic proponent.
Extensive testing over all configurations, operating parameters, and use cases is the only way to minimise the likelihood of releasing a product with a hardware issue. Even relatively simple hardware can end up catching you out by throwing up some unexpected bug or issue. And even the big guys with huge development teams and test labs occasionally mess things up — anyone remember the Pentium FDIV bug?
We made several mistakes with the first version of the PoE HAT:
- USB load testing was performed using boards that had the working switch
- Our field testing programme was abbreviated because the product was late
- We didn’t inquire as to whether our field testers were using high-current peripherals (they weren’t)
It’s embarrassing to have released a product with a bug like this, but it’s a lesson well-learned, and we will be improving our internal processes to prevent a recurrence.
Fortunately, this bug turned out to be easy to fix. We designed an L-C filter to apply further smoothing to the output current from the HAT. The filter consists of a little extra input and output capacitance and a 4.7µH inductor (chosen to have a suitable current rating and DC resistance), as well as 330mR resistor in parallel to provide damping. We were even able to wrap the mod up in a little mezzanine PCB that fits neatly underneath the board.
Once we had confirmed that there was a problem with the PoE HAT, we took the product off sale, and recalled and reworked the outstanding units. We are now happy to announce that most Approved Resellers should now have the revised boards in stock. We believe that most people who have been affected by this issue have already returned their PoE HATs for a refund; if you’re experiencing issues and haven’t yet returned your product, you can get in touch with your reseller to arrange a replacement.
I’d like to thank the members of the Raspberry Pi engineering team, our contract manufacturing partners Taijie, our licensee partners and Approved Resellers, and also the community members who kindly tested prototypes of the fixed board design. This hasn’t been the easiest product launch in our history, but hopefully the lessons learned have set us up well for the future.
Hi, thanks for the explanation. I was wondering, where had the POE HAT gone from Farnell store.
I’ve designed a POE enabled 100 Mbit/s FSO units some time ago and have had some issues with ramp up of voltage and inrush current. PoE limits capacitance on powered devices to limit inrush current in the first place, so the higher the switching frequency the better – less ripple. But in our case the switching frequency was interfering with input LC filter. It was not the best design. I understand your lessons learned very well :-)
My single question therefore is, how well does it ramp up voltage (presumably on 5V rail) when loaded (via USB devices for example) to the max and then being plugged in?
This is a good write-up of an interesting bug. Stuff happens, and it’s always interesting to see how it’s handled – one of the best ways to understand something is to break it and work out what’s going on. A lot has gone right here.
However, the communications strategy of not posting this to the blog until now, and not updating the original blog post (which, as of this morning, was the most current high-profile communication on the issue and was still claiming the Hat was great and ‘available now’ when it wasn’t), was a really poor choice.
It didn’t work well from a commercial perspective, leaving customers in the dark, and it didn’t work well from an educational perspective either. There’s clearly always a temptation to make less noise about a mistake than about a success, but the way this worked out with information being split between a community forum thread, a Register article, and fragmented Twitter conversations was woeful.
James has been doing a sterling job in the forum, but whoever thought it was a good idea to try to keep this on the down-low should think again.
Having it prominently featured on their front page is “Keeping it on the down-low”? In what universe is that the case? I would never have heard about it, if it wasn’t on the front page. PoE is an interesting feature to have for the Raspberry Pi, but likely won’t be used by 90%+ of their user base. I find it rather interesting that it’s supported, but have a hard time finding myself making use of it. It’s unfortunate they’ve had troubles with this board. Though, it’s quite good that they posted prominently on their front page the troubles they have been facing regarding this feature.
I’m baffled too. Here I am “keeping it on the down-low” in The Register when the problem first surfaced:
You can’t please some people.
Some people just love a conspiracy theory. I thought you were all honest and straightforward about what happened here, and it’s been interesting reading about how a company fixes something like this. It was really refressing. Not many companies do this.
The Foundation has known about the problem for weeks.
You could have updated the original blog post, you didn’t.
You could have posted a new one, you didn’t.
You could have tweeted it, you didn’t.
You could have responded to questions on Twitter, you ignored them.
It doesn’t require a conspiracy theory to think that’s pretty poor comms. As I said, this post is good. But there should have been one weeks ago. You know how you get people’s attention on things if you want it, and the Foundation has not done those things up until now.
The first comment on this post is from someone wondering why the Hat had disappeared. They wouldn’t have been wondering if it had been communicated clearly.
We could have done any of these things.
Instead, we talked to the press, engaged with our customers on the forums (including you, despite the continuous high-pitched whining – cheap ceramic caps maybe?), took the product off sale for a while and fixed the problem.
“We could have done any of these things.”
And you should have. You know what you do when you want to get people to notice something, and you didn’t do it. Don’t pretend that the forum threads are equivalent, they’re not.
Maybe we’re starting from different places – I’d have thought you’d want to keep people in the loop as effectively as possible. But maybe you don’t. Maybe you’re happy that people were in the dark. But I’ve no idea why you would be.
dude… your asking a whole lot.. its an org man. dont be a douce. they refunded and fixed didnt they? yeah man go home nobody wants you in this thread.
Note: “until now”
Thank you James for this detailed and illuminating description of your experiences and the Foundation’s honesty in declaring something was wrong (and what, and why).
This is a small hardware bug on a peripheral and only goes to show how much care and attention has gone into all the /other/ Raspberry Pi products from day one.
Congratulations to all involved.
Any chance of getting the mezzanine board so we can rework them ourselves? I bought three of the original boards, but only have one unopened, and it’s not really worth shipping it back, but I’d happily do the rework myself…
Can you mail me (firstname.lastname@example.org), and I’ll see what we can do. It’s a fairly simple mod, just not sure if we have any spare mezzanine boards kicking around.
I think the best part of this issue is the article announcing it. There was a bug in the product, you took steps to resolve the issue, and now that a fix is available, you are being open about the issue rather than trying to deny or sweep it under the rug. More companies need to take this approach.
Only thing I would change is to include some way of telling the two boards apart for those reading the article who might still have a bad switch.
This is exactly my problem…
I have the PoE HAT acquired at Pimoroni.
I used it just a little bit with mouse and keyboard.
I don’t know how to test if my HAT is OK or not.
I almost acquired an USB load tester, but there must be another way to tell.
I knew about this problem weeks ago when Eben gave the interview and Raspberry Pi held their hands up and admitted their bad.
In my opinion this was superbly handled and a lot of other businesses can learn from this approach. This has increased my goodwill towards Raspberry Pi no end. And I already held them in high esteem. I’m confident to buy their stuff knowing they stand by it.
I’ll probably never need a POE hat, but if I do, I won’t have any worries about buying one.
It’s extremely refreshing to see a problem discussed in the open like this, as opposed to the old “sweep it under the carpet” routine. Extremely well done, and thanks.
I would like to know if its possible to buy just the RPI-POE-MOD daughter-board, I can cut the trace and install this myself.
Can you mail me (email@example.com), and I’ll see what we can do. It’s a fairly simple mod, just not sure if we have any spare mezzanine boards kicking around.
Thank you for the quick fix release.
Whether or not the response to the issue was handled correctly is subjective.
I am impressed that it only took 3 months to rectify and that we have a full explanation.
Good to know that RPF / RPT are humble enough to admit their mistakes.
I am loving this breakdown of the problem, resolution and the foundation being so up front about it. Super refreshing being able to trust a company/non-profit. Keep up the great work!
POE ALL THE THINGS
Any chance there will ever be an 802.3at HAT for the full 25W of power?
A 802.3bt HAT would be even more amazing but I know that’s still very niche right now.
We’re considering it, though the current available from 802.11af is almost exactly equal to our peak theoretical consumption at the point where the switch (legitimately) trips.
OK, I am doing something wrong then (LOL).
I have a POE hat running from a HP Procurve 2915 switch.
On my Pi, I have the poe board and then on top of that (love how the pins can pass through for additional hats) an additional audio hat. I then have a flightaware USB stick capturing flight data.
Granted I did unplug the fan as it was noisy and I already had a heatsink on that Pi.
I’m glad it’s working for you. We’ve seen a lot of people using the HAT successfully: we didn’t enjoy taking the product off sale because we knew that for every one person we were saving from the issue we were probably inconveniencing ten people who wouldn’t have experienced it.
It sounds like either your receiver stick has fairly low power consumption, or you have a Raspberry Pi with one of the switches that doesn’t exhibit the problem.
You also have to say it is not your first faux pas, the official Raspberry Pi case covers up the Pi hat pins, leaving you to dremel the case to make the hat pins usable. Then the lid is loose.
Raspberry Pi Staff Simon Long
The case was never designed to take a Pi with a HAT attached, and we’ve never claimed otherwise. (Some do fit, but that’s coincidence, not by design.) So hardly a “faux pas”.
An excellent article that will benefit both the layman and the engineer or, even better, the student engineer. Thank you for your honesty, which, as a person who is involved in the daily support of large amounts of electronic equipment, is very refreshing. Just recently I have observed an issue in a laser projector from one particularly large consumer electronics company which had a VERY intermittent issue and one serious enough to cause quite a lot of inconvenience in the forum in which it was deployed. After extensive testing we’ve now confirmed that a number of units that are just a few serial numbers apart have exactly the same issue. We will likely never know what happened when a fix is applied as it is likely that they will not give any information, even if directly queried. I have seen multiple similar scenarios over a number of years which begs the question as to how many devices out there have inherent faults that were not detected during the product testing phase? Again, great write up and the honesty is much appreciated. Also @Eben – cheap ceramic capacitors indeed lol
I find the way you have addressed this issue, being very open about the issue, the reasons, the mea culpa, the publishing of the analysis, the follow up actions….. EXEMPLARY!
A shining example for many commercial software and hardware vendors!
Do the bulk decoupling capacitors on the Pi 3B+ have sufficiently low ESR for operation at 100 kHz?
Ok,’now I want one :-) with so much networking going on I see so many ways to use the pie powered over the network wirh just a little Ethernet run.
Well done. And as to communicating the issue and solution, I like more information over less but really just prefer the end result and “we ran down many different possibilities and here are a few. Our fault, we know how to test, here’s the fix” seems pretty solid and transparent to me.
This is an excellent write up of a fault finding investigation, and having worked in contract manufacturing electronics, its the type of problem that is not easy to isolate and resolve.
I saw ‘made in China’
Yes you would, it’s silk screened on the board. Because that is where it is made.
However i can not run even low powered USB-devices(i.e. keyboard), i can not even run unpowered ones, with the PoE hat. Reading the forum thread about this problem confirms that other people cant run it with low powered devices either.
There is still a lot of ripple/spikes in the supplied voltage after using this fix. But it might be good enough?
Anways, thanks for the blog post.
So you have a modified board, but it will not work with low power devices, or are you still using the original board? In which case this low power issues was known about.
Sorry, it was the wrong application note, the correct document was SLVU136, see page 5.
I use the POE HAT as well. I noticed, with the POE HAT, there is this typical high frequency coil noise at least when powering up.
I my experience, good electronics designs don’t display such behaviour and have much longer life-time. I am always surprised, that even some very professional companies don’t do a short high-sensitivity microphone (or e.g. a musician teenager ear) test with their electronics. No coil noises just improves the confidence and the user experience, e.g. in the case of a NTP-clock in your bedroom.
I have had a new PoE HAT delivered from my supplier, please can you confirm what I need to look for on the board or packaging to confirm it is the updated spin?
See the excellent pictures at the end of the article. The last one shows the new revised board with attached mezzanine
Uh oh. I just bought 2 PoE HATs from Dustinhome.fi, a Swedish company. Unfortunately, both the boards do not have the mezzanine visible in your example photos. They are like the original here: https://www.raspberrypi.org/app/uploads/2018/11/PoEHAT1CU.jpg
Should I return the devices to reseller and claim product return / fault etc? Mind, they claim they still have a large stock available, so others may face the same issue.
If your combination of Raspberry Pi, PoE HAT and workload exhibit the issue, you should certainly return it. I’ll get in touch with Dustinhome and see why they haven’t returned their stock for remanufacture.
I’ve been following the original PoE HAT development and subsequent fix, and finally purchased one last week (and confirmed it matches the final board with the mezzanine). I successfully powered up my unit one time via the PoE HAT to test and confirm all was well, but subsequent boot ups are unsuccessful with only a solid red light. Removing the PoE HAT and going back to the normal USB power supply still results in a solid red light and no boot up.
I have over 6 Pi’s in use with plenty of units to test with, and I’ve since found that any Pi (3 B+, of course) I plug my PoE HAT in to and attempt to boot up with will be fried, and never boot up again (just a solid red light) –> This also occurs if using the USB power supply and not the PoE HAT for power while it is installed. The act of having my PoE HAT installed and attempting to power my Pis up render them permanently bricked.
I now have 3 Raspberry Pis that will not boot up, with or without the PoE HAT, and I will attempt to return it to the store unless the developers here would like to capture it (and any of my non-bootable Pis) for failure analysis.
Thank you for this informative report; I’ll draw it to the attention of our engineers and we’ll let you know if they want to take a look at those units to figure out what’s going on here.
Helen – many thanks for reaching out, but I’d like to offer an apology and a correction to my previous post that I discovered while disassembling my units. It appears that while moving my Pi to the official Raspberry Pi case to accommodate the PoE HAT, I bent the micro SD card which introduced a crack.
So, it appears that my initial boot ups were successful, because they were all using their ‘original’ SD cards, but would later die due to the introduction of this cracked SD card (and the PoE HAT). My Raspberry Pi 2’s (without the PoE HAT) would of course not boot up with this SD card, but still continue to work, so my combination of cracked SD card and PoE HAT appears to be the issue.
Many apologies to the group and this thread for not noticing this detail earlier!
Have PoE HAT problem. it continually disconnects from Ethernet or won’t even show up on the Ethernet connection but will power the Pi and work over Wifi.
It’s good to see the problems get fixed in the open, some big companies should be taking notes right now :)
That is the legit approach, good job!
Comments are closed