This post is dedicated to the requestors: the down stream systems which generate the query traffic our DNS servers see.
Those who come to us with QR bit and ANCOUNT set to 0. The QR bit is a field in a DNS Packet which signals whether the packet is a query or a response. The ANCOUNT field is the number of answer records in the response packet. The majority of systems which we receive requests from as an authoritative DNS operator are recursive name servers operated by Internet Service Providers ( ISPs ), corporations, or private individuals. We also receive a number of queries from the CDN (Content Distribution Networks such as Akamai, CacheFly, and Fastly) providers of our customers. These providers are seeking to maintain the most up to date name resolution mapping data while cutting out the middle person of the caching layer.
Thinking about the growth of the number of Internet end users and the change in the ratio of end users to devices helps put the challenge of scaling the DNS into perspective. Flash back fifteen to twenty years to the model of a single household computer: sitting on a desk, tethered to a wall for dial up access and power. Contrast that with the current state of technology where a household or person can have one or more home computers, tablets, smart phones, televisions, game consoles, appliances, Amazon Dash buttons, and more (further reading, click here). As the number of devices which need to resolve names grows, so does the complexity of the infrastructure which serves responses.
A fine tuned recursive resolver layer keeps DNS query response times down. Not only do the servers need to be able to process a growing number of queries, but they need to hold larger and larger arrays of records in their cache. Recursive providers track this with an operational metric commonly referred to as the cache hit ratio or the cache hit to miss ratio. The more cache hits, the less frequently the recursive resolver needs to query the authoritative which means one less query blocking a response to the client. This is also a key metric to track if your goal is understanding the end user perception of performance.
One way ISPs and infrastructure providers have handled this scaling challenge is by implementing a pool of resolvers which are affectionately referred to as a recursive farm. The behavior of the pool of resolvers differs by operator. For example, we have observed instances where a client request which causes a cache miss in the recursive pool triggers three to five queries to our authoritative servers. There are a number of implementation root causes which cause this to occur: each recursive in the pool sends it own query to the authoritative, each recursive sends one query per NS record, popular domains recursives are scheduled to query before the TTL on the record expires to keep it fresh in the cache, etc. The main takeaway here is that there are a number of different patterns generated by recursive resolvers based on their configuration and implementation of caching semantics.
Another one of the things that we see is query traffic driven by security appliances. Security appliances keep track of the resources associated to the records as end users visit or receive e-mail from domains. For example, If your employer is using one of these devices and you receive an email with a questionable link such as www1[.]bmo.com[.]mhyyjzicptbke[.]hoperk.is-a-financialadvisor.com, the security appliance might look at that DNS name, see .com (a TLD, Top Level Domain) in a position where it shouldn’t be and trigger some logic that says, “Keep an eye on this.” In this case, the device will continue to issue queries for the domain long after the original query was answered. The frequency of query and the duration which they continue to ask for seems to vary by appliance or configuration, but it is yet another source of anomalous query activity. These devices are keeping track of what addresses were associated with a specific domain name to provide context in the event that later the domain name is found to have been nefarious.
Over time, we notice examples of new anomalous behavior as our customers and the internet ecosystem itself grows and changes. For example, around the time when a customer who operates an image centric blogging platform was in the process of being acquired, we started to see an anomalous number of queries for subdomains of the service. Across our DNS servers the volume of additional queries per second wasn’t noticeable, but, in the geographic region in closest network proximity to a mysterious requestor, we saw a massive increase in query volume. The queries were properly formed which ruled out an unsophisticated attacker. They were also for unique FDQNs which didn’t appear to be algorithmically generated. The DNS specific metric which flagged anomalous activity for the domain was the number of NXDOMAIN responses, as some of the subdomains being requested were no longer available in the DNS. Other metrics which could offer insight into this behavior were top requestors and top requestors by domain. This new requestor was not a known recursive resolver, CDN, or security device and didn’t match the interaction pattern we had seen. If anything, it looked more like an in protocol attack than anything else. We were able to reach out to the customer and explain to them that, for this time period, they had an anomalous usage pattern a few standard deviations from the norm anomalous. They didn’t seem concerned so no additional actions were taken.
As an authoritative DNS provider, looking at the patterns of “who is requesting what” is part of the foundation of operational awareness. Without a baseline for requestor behavior, understanding and differentiating abusive traffic from the norm is impossible. As we observe the activity of different parties and cluster them by different variables we are able to tease out these profiles. This has lead to internal language changes. For example, it's not accurate to assume that the queries we see come from recursive resolvers, so, instead of saying unique resolvers, we now refer to them as unique requestors. This change was the first step in opening up to a new way of thinking about authoritative DNS interaction patterns. Different patterns make more sense as you research the requestors end goals, or, more simply put, when all you think about is hammers it's easy to to treat a screw like a nail.
About the Author
Chris is a Principal Data Analyst Dyn, a cloud-based Internet Performance company that helps companies monitor, control, and optimize online infrastructure for an exceptional end-user experience. Follow Dyn on Twitter: @Dyn.More Content by Chris Baker