Fluentd not tailing pod logs (Solved)

I have an AKS cluster with a number of pods, and I want to aggregate the logs from all of the pods to a third-party service (in my case, Papertrail) using Fluentd. So I signed up with Papertrail and followed their guide to integrate a Fluentd daemonset into the cluster and start ingesting logs into my Papertrail account.

But I had a couple of problems:

  • None of the pod logs were appearing in Papertrail :(
  • The Fluentd pods would crash and restart after a few minutes

This post is how I solved my problem and the diagnostic steps I took.

Table of contents

AKS cluster

My AKS cluster is using version 1.20.2 of Kubernetes. This uses containerd as the container runtime.

Using Papertrail’s Fluentd image

On Papertrail’s documentation page under the Log from all containers and systems section they provide a link to Docker image which is intended to get you up and running with minimal effort. They also provide a link to a Daemonset manifest file.

Following their steps and configuring my Papertrail host and port details, I was able to see some logs appear in Papertrail:

Apr 17 10:11:43 mycluster-aks-agentpool-31130745-vmss000001 kubelet I0417 02:11:42.563823    4782 logs.go:317] Finish parsing log file "/var/log/pods/kube-system_kube-proxy-pqpxg_829a536d-24bd-4603-91b8-a0d4a9492397/kube-proxy/0.log"
Apr 17 10:15:33 mycluster-aks-agentpool-31130745-vmss000000 kubelet I0417 02:15:32.606277    4607 logs.go:317] Finish parsing log file "/var/log/pods/kube-system_kube-proxy-rcph5_33226e53-894c-4153-8f05-2684d7af9043/kube-proxy/0.log"
Apr 17 10:15:34 mycluster-aks-agentpool-31130745-vmss000001 kubelet I0417 02:15:33.000796    4782 logs.go:317] Finish parsing log file "/var/log/pods/kube-system_kube-proxy-pqpxg_829a536d-24bd-4603-91b8-a0d4a9492397/kube-proxy/0.log"
Apr 17 10:19:40 mycluster-aks-agentpool-31130745-vmss000000 kubelet I0417 02:19:39.224809    4607 logs.go:317] Finish parsing log file "/var/log/pods/kube-system_kube-proxy-rcph5_33226e53-894c-4153-8f05-2684d7af9043/kube-proxy/0.log"

But nothing from my pods was being logged. Furthermore, after 5 minutes or so the Fluent pods would crash and restart.

Checking resource usage of pods

I suspected the crashes may have had something to do with the Fluentd pod exceeding memory limits. Running kubectl top pods -n core showed this:

C:\Users\ryan>kubectl top pods -n core
NAME                                                CPU(cores)   MEMORY(bytes)
fluentd-papertrail-tc2fq                            100m         106Mi
fluentd-papertrail-tmjpt                            97m          108Mi

Memory looks fine, but the CPU is right at the limits I had defined in the manifest of 100m. Looks like it’s doing way more work than it should be.

Backslashes in Fluentd logs

Checking the logs from the Fluentd pods showed a lot of messages composed of a large number of backslashes:

2021-04-10 08:48:00 +0000 [warn]: #0 pattern not match: "2021-04-10T08:47:57.316098541Z stdout F  \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ etc...

Googling this indicates that it’s likely caused by Fluentd tailing its own logs. Checking the Fluentd logs right after a pod restart confirmed this:

2021-04-10 08:47:56 +0000 [info]: #0 following tail of /var/log/containers/fluentd-papertrail-7dzdt_core_fluentd-57857a9c614ff5a34be5a910af6ab0191337f5480f4bcfeba6c5650405d62944.log

Configuring Fluentd to ignore its own logs

To ignore Fluentd from tailing its own logs, I learned about the environment variable <code>FLUENT_CONTAINER_TAIL_EXCLUDE_PATH</code> which is used to ignore certain container logs. So I added this to my daemonset manifest under the env section:

- name: FLUENT_CONTAINER_TAIL_EXCLUDE_PATH
  value: /var/log/containers/fluent*

On redeploying to the cluster, nothing had changed. Fluentd was still tailing logs with a bunch of backslashes. Then I noticed this note in the Fluentd Github regarding the FLUENT_CONTAINER_TAIL_EXCLUDE_PATH environment variable:

Since v1.9.3 or later images.

Suspecting the Papertrail image might have been created using an earlier version of Fluentd, I changed the Daemonset manifest to use the latest image from the Fluentd project:

image: fluent/fluentd-kubernetes-daemonset:v1.11-debian-papertrail-1

After deploying, things were much quieter! No more logs with backslashes.

However there were still no logs from my pods coming through in Papertrail, and I started seeing entries like this in the Fluentd logs:

2021-04-17 07:05:41 +0000 [warn]: #0 [in_tail_container_logs] pattern not matched: "2021-04-17T07:05:40.347800353Z stdout F 100.101.102.103 - - [17/Apr/2021:07:05:40 +0000] \"GET / HTTP/2.0\" 200 316 \"https://mywebapp.com\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) Gecko/20100101 Firefox/87.0\" 59 0.137 [core-mywebapp-api-80] [] 10.240.0.26:3001 316 0.136 200 34d669e9-cd7b-4d2b-8cd4-e1045930c917"

This is a log entry from a pod for my web application, so it’s definitely tailing the logs now but for some reason is not matching the log format.

Configuring container tail parser type

Whilst I was googling how to prevent Fluentd from tailing its own logs, I had come across a post that mentioned the environment variable FLUENT_CONTAINER_TAIL_PARSER_TYPE. The author also mentioned they were using containerd instead of docker, and containerd is the runtime used by AKS in versions >= 1.20…🤔

After a bit of research it turns out we need to tell Fluentd how to parse containerd logs, and one way is to add this to the Daemonset manifest in the env section:

- name: FLUENT_CONTAINER_TAIL_PARSER_TYPE
  value: /^(?<time>.+) (?<stream>stdout|stderr) [^ ]* (?<log>.*)$/

After deploying this latest change, Papertrail logs started appearing for my pods:

Creating a resource group

Running top to see the resource usage:

C:\Users\ryan>kubectl top pods -n core
NAME                                                CPU(cores)   MEMORY(bytes)
fluentd-papertrail-lq4vx                            6m           109Mi
fluentd-papertrail-mn8qm                            4m           110Mi

That’s looking much healthier, CPU usage is way down from the 100m it was previously. I haven’t seen any crashes either, so hopefuly that issue has been resolved but I’ll continue to monitor to look out for pod crashes in the future.

Summary

  • If you see a bunch of backslashes in the Fluentd logs, it’s tailing its own logs so you will want to suppress that. I used the FLUENT_CONTAINER_TAIL_EXCLUDE_PATH environment variable to solve that.
  • If your cluster is running containerd as the container runtime, you will need to tell Fluentd to recognise the log format. I used the FLUENT_CONTAINER_TAIL_PARSER_TYPE environment variable to solve that.