Tuesday, May 7, 2019

DockerCon Hallway Track - Logging to a NFS mounted Docker Volume

Recently at DockerCon I decided to check out the whole 'Hallway Track' concept and share a sticky situation we ran into at work. If you've never been, the Hallway Track is a coordinated effort to get conference attendees talking to one another and sharing knowledge. You post your topic or peruse the posted topics and they facilitate connecting people together to discuss face to face. I was amazed at the number of people who requested to talk to me about our experiences with the complications of writing application logs to a NFS mounted docker volume. The official title was "Lessons learned: Why writing to flat file logs in an NFS mounted docker volume is unwise".

So, I'll preface the rest of this by saying that an NFS mounted docker volume that is shared by multiple containers of a single service works VERY WELL for 99% of use cases.

When we converted our applications to run in Docker, we did so with as little refactoring as possible. It all worked great locally and in staging with a small swarm using a shared NFS volume where the application files were stored. This seemed to work in production as well, until we redirected the final (and busiest) site to the new host dns.  It was then that we started to experience the containers starting to cascade failures (cue total panic).  Digging through the service logs in Stackdriver wasn't giving us much, and we could see database latency spiking, so at first we blamed the new remote database and decided we needed to reduce the load to that and switch everything to local logs instead of database logs. This had the opposite of the desired effect - as now our containers were going unhealthy even faster.

Enter Blackfire.io to save the day. We finally managed to capture what was happening with a trace on the production server that pointed to the logging as being the culprit of the failures, when we caught a page dying on fsock open. Apparently, it's incredibly unwise to have 9 php services each running 1k+ worker threads all trying to write to the same warning.log file (complete facepalm). Very obvious once you think of it in those terms. The real kicker was that the items being logged were warnings, not critical errors. We've since taken a good look at what really needs to be logged and what things add no value, just overhead.

Moral of the story? If your NFS docker volume may need to be written to by multiple docker containers (which any service at scale in production is sure to need to do), ensure that whatever you are writing utilizes unique file names/paths per container. NEVER write to the same file from multiple containers simultaneously unless you want write lock conflicts.

In our case, the simplest solution was to switch our Yii log targets from \yii\log\FileTarget to \yii\log\DbTarget (the exact opposite of our knee jerk solution).  As soon as we resolved that issue, everything stabilized and has been smooth sailing.  (Other than the fact that the database logs don't auto rotate - currently we're examining options for sending application logs out to Stackdriver or some other log analyzer so that we can consolidate them with the infrastructure logs)

No comments:

Post a Comment