<?xml version="1.0" encoding="us-ascii"?>
  <?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>
  <!-- generated by https://github.com/cabo/kramdown-rfc2629 version 1.0.40 -->

<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY RFC1035 SYSTEM "https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.1035.xml">
<!ENTITY RFC2181 SYSTEM "https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.2181.xml">
<!ENTITY RFC2119 SYSTEM "https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC7719 SYSTEM "https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.7719.xml">
]>

<?rfc toc="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>

<rfc ipr="trust200902" docName="draft-tale-dnsop-serve-stale-00" category="std">

  <front>
    <title abbrev="DNS Serve Stale">Serving Stale Data to Improve DNS Resiliency</title>

    <author initials="D.C." surname="Lawrence" fullname="David C Lawrence">
      <organization>Akamai Technologies</organization>
      <address>
        <postal>
          <street>150 Broadway</street>
          <city>Cambridge</city>
          <code>MA 02142-1054</code>
          <country>USA</country>
        </postal>
        <email>tale@akamai.com</email>
      </address>
    </author>
    <author initials="W." surname="Kumari" fullname="Warren Kumari">
      <organization>Google</organization>
      <address>
        <postal>
          <street>1600 Amphitheatre Parkway</street>
          <city>Mountain View</city>
          <code>CA 94043</code>
          <country>USA</country>
        </postal>
        <email>warren@kumari.net</email>
      </address>
    </author>

    <date year="2017" month="March"/>

    <area>Internet</area>
    <workgroup>DNSOP Working Group</workgroup>
    <keyword>Internet-Draft</keyword>

    <abstract>


<t>This draft defines a method for recursive resolvers to use stale DNS
data to avoid outages when authoritative nameservers cannot be reached
to refresh expired data.</t>



    </abstract>


    <note title="Ed note">


<t>Text inside square brackets ([]) is additional background
information, answers to frequently asked questions, general musings,
etc.  They will be removed before publication.  This document is being
collaborated on in GitHub at
&lt;https://github.com/vttale/serve-stale&gt;.  The most recent
version of the document, open issues, etc should all be available
here.  The authors gratefully accept pull requests.</t>


    </note>


  </front>

  <middle>


<section anchor="introduction" title="Introduction">

<t>Traditionally the Time To Live (TTL) of a DNS resource record has been
understood to represent the maximum number of seconds that a record
can be used before it must be discarded, based on its description and
usage in <xref target="RFC1035"/> and clarifications in <xref target="RFC2181"/>.
Specifically, <xref target="RFC1035"/> Section 3.2.1 says that it "specifies the
time interval that the resource record may be cached before the source
of the information should again be consulted".</t>

<t>Notably, the original DNS specification does not say that data past
its expiration cannot be used.  This document proposes a method for
how recursive resolvers should handle stale DNS data to balance the
competing needs of resiliency and freshness. It is predicated on the
observation that authoritative server unavailability can cause outages
even when the underlying data those servers would return is typically
unchanged.</t>

<t>There are a number of reasons why an authoritative server may become
unreachable, including Denial of Service (DoS) attacks, network
issues, and so on.  This document suggests that, if the recursive
server is unable to contact the authoritative server but still has
data for the query name, it essentially extends the TTL of the
existing data on the assumption that "stale bread is better than no
bread".</t>

<t>Several major recursive resolver operations currently use stale data
for answers in some way, including Akamai, OpenDNS, and Xerocole.</t>

</section>
<section anchor="terminology" title="Terminology">

<t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in <xref target="RFC2119"/>.</t>

<t>For a comprehensive treatment of DNS terms, please see <xref target="RFC7719"/>.</t>

</section>
<section anchor="description" title="Description">

<t>Three notable timers drive considerations for the use of stale data,
as follows:</t>

<t><list style="symbols">
  <t>A client response timer, which is the maximum amount of time a
recursive resolver should allow between the receipt of a resolution
request and sending its response.</t>
  <t>A query resolution timer, which caps the total amount of time a
recursive resolver spends processing the query.</t>
  <t>A maximum stale timer, which caps the amount of time
that records will be kept past their expiration.</t>
</list></t>

<t>Recursive resolvers already have the second timer; the first and
third timers are new concepts for this mechanism.</t>

<t>When a request is received by the recursive resolver, it SHOULD start
the client response timer.  This timer is used to avoid client
timeouts.  It SHOULD be configurable, with a recommended value of 1.8
seconds.</t>

<t>The resolver then checks its cache for an unexpired answer. If it
finds none and the Recursion Desired flag is not set in the request,
it SHOULD immediately return the response without consulting the
cache for expired records.</t>

<t>If iterative lookups will be done, it SHOULD start the query
resolution timer.  This timer bounds the work done by the resolver,
and is commonly around 10 to 30 seconds. [ BIND 9 used to use a
hard-coded constant of 30 seconds and has more recently added a
configuration parameter that defaults to 10 seconds and is capped at
30. A rigorous exploration of other implementations has not yet been
done. ]</t>

<t>If the answer has not been completely determined by the time the
client response timer has elapsed, the resolver SHOULD then check its
cache to see whether there is expired data that would satisfy the
request.  If so, it adds that data to the response message and SHOULD
set the TTL of each expired record in the message to 1 second.  [
This 1 second TTL is ripe for discussion. ] The response is then sent
to the client while the resolver continues its attempt to refresh the
data.</t>

<t>The maximum stale timer is used for cache management and is
independent of the query resolution process. This timer is
conceptually different from the maximum cache TTL that exists in many
resolvers, the latter being a clamp on the value of TTLs as received
from authoritative servers.  The maximum stale timer SHOULD be
configurable, and defines the length of time after a record expires
that it SHOULD be retained in the cache.  The suggested value is 7
days, which gives time to notice the resolution problem and for human
intervention for fixing it.</t>

<t>This same basic technique MAY be used to handle stale data associated
with delegations.  If authoritative server addresses are not able to
be refreshed, resolution can possibly still be successful if the
authoritative servers themselves are still up.</t>

</section>
<section anchor="implementation-caveats" title="Implementation Caveats">

<t>Answers from authoritative servers that have a DNS Response Code of
either 0 (NOERROR) or 3 (NXDOMAIN) MUST be considered to have
refreshed the data at the resolver.  In particular, this means that
this method is not meant to protect against operator error at the
authoritative server that turns a name that is intended to be valid
into one that is non-existent, because there is no way for a resolver
to know intent.</t>

<t>Resolution is given a chance to succeed before stale data is used to
adhere to the original intent of the design of the DNS.  This
mechanism is only intended to add robustness to failures, and to be
enabled all the time.  If stale data were used immediately and then a
cache refresh attempted after the client response has been sent, the
resolver would frequently be sending data that it would have had no
trouble refreshing.</t>

<t>It is important to continue the resolution attempt after the stale
response has been sent, until the query resolution timeout, because
some pathological resolutions can take many seconds to succeed as they
cope with unavailable servers, bad networks, and other problems.
Stopping the resolution attempt when the response with expired data
has been sent would mean that answers in these pathological cases
would never be refreshed.</t>

<t>Canonical Name (CNAME) records mingled in the expired cache with other
records at the same owner name can cause surprising results.  This was
observed with an initial implementation in BIND, where a hostname
changed from having a CNAME record to an IPv4 Address (A) record.
BIND does not evict CNAMEs in the cache when other types are received,
which in normal operations is not an issue.  However, after both
records expired and the authorities became unavailable, the fallback
to stale answers returned the older CNAME instead of the newer A.</t>

<t>[ This probably applies to other occluding types, so more thought
should be given to the overall issue. It should probably also be
rewritten to not suggest that this only a quirk of BIND. ]</t>

<t>Keeping records around after their normal expiration will of course
cause caches to grow larger than if records were removed at their TTL.
Specific guidance on managing cache sizes is outside the scope of this
document.  Some areas for consideration include whether to track the
popularity of names in client requests versus evicting by maximum age,
and whether to provide a feature for manually flushing only stale
records.</t>

</section>
<section anchor="implementation-status" title="Implementation Status">

<t>[RFC Editor: per RFC 6982 this section should be removed prior to
publication.]</t>

<t>The algorithm described in this draft was originally implemented as a
patch to BIND 9.7.0.  It has been in production on Akamai's production
network since 2011, and effectively smoothed over transient failures
and longer outages that would have resulted in major incidents. The
patch has been contributed to the Internet Systems Consortium in
anticipation that it will be incorporated to their main BIND
distribution.</t>

</section>
<section anchor="security-considerations" title="Security Considerations">

<t>The most obvious security issue is the increased likelihood of DNSSEC
validation failures when using stale data because signatures could be
returned outside their validity period.  This would only be an issue
if the authoritative servers are unreachable, the only time the
techniques in this document are used, and thus does not introduce
a new failure in place of what would have otherwise been success.</t>

<t>Additionally, bad actors have been known to use DNS caches to keep
records alive even after their authorities have gone away.  This makes
that easier.</t>

</section>
<section anchor="privacy-considerations" title="Privacy Considerations">

<t>This document does not add any practical new privacy issues.</t>

</section>
<section anchor="nat-considerations" title="NAT Considerations">

<t>The method described here is not affected by the use of NAT devices.</t>

</section>
<section anchor="iana-considerations" title="IANA Considerations">

<t>This document contains no actions for IANA.  This section will be
removed during conversion into an RFC by the RFC editor.</t>

</section>
<section anchor="acknowledgements" title="Acknowledgements">

<t>The authors wish to thank Matti Klock, Mukund Sivaraman, Jean Roy, and
Jason Moreau for initial review.</t>

</section>


  </middle>

  <back>

    <references title='Normative References'>

&RFC1035;
&RFC2181;
&RFC2119;


    </references>

    <references title='Informative References'>

&RFC7719;


    </references>



  </back>

<!-- ##markdown-source:
H4sIAGjUwlgAA41aYXcbt7H9jl+B53x4dg/FULJTx2rPe2UlJ1FjSa6kNO2J
c3rAXZBEtbvYLrCi2Z78996ZAXaXDtP2Q2KJ3AUGM3fu3Bno5ORERRcre67v
bffkmo2+j6ay+tJEo6PXV3Xb+Sf8fnOv72xwlbNNsVdmters0zl/TC9aeU2V
vmhMjdXKzqzjCX12UjbBtyeBnjoJ/MlioUoT8dTZ4vT1yeKlUq7tznXs+hDP
Fos3izNlOmvO9VUTbdfYqHYb3uv2vf7ed49k5ted71v1uBsfOrmkPVVh4rkO
sVSq8CWePNd9ODGhcE617lxpfYJzFfxv2NedXQf52XeRf1Gmj1vf8ZP4T2vX
BGw+v5jrd2bX4fiWP5ZzXponV+qLw698h12Xj6Y2Tj/YYtv4ym+cDfxlwDYW
Jp5+sdC/77wpd2bPXxQu7s/1halXnSs3shKOgE2ul3pxdvrq7OR08cWr9Hnf
xA6Pf3e/5A8s9qrgQrj3d4Z3nhe+VoeH+H6uv+1r07nJAb43HQyffs7Wf+39
prKHBv96sdDLut26uLUGH+r3pns8tP6a7DKu0X9ydjc5wcVSv3m1ePXy3xq/
Y1N+98imzCns6uTkRJsVLDAFfnvYuiDI0qVdu8YGbXRtEa5Sr32nO1v0XXBA
Y2eDr55sFwjDfbA6CKpv7gl5jGzz5BE530ezwTq7LZwgkXfRRFqD/MOoxSqF
aRof9YpWNsXWlgorAC7YZ6vtx9Z1ttS08lxsxsP2r2/Lv9K/sNt+jBQBV8KQ
v/fAtl7hRI82Bv38ww8ffnyhcTBTli4635hKr/DlBgBvSuUanKw29MVMmybs
0qGw9d9728Rqr014xO74LdBTYaY3trEdlqn7gAQIM2VjMdf6YWv3eueqSs5R
I69L/Ij1rW77VeUK3oafJEf7oq+xA9m2slgICVVVZuU75C481+BI+msXv+lX
2kT14bfbGNtw/vnnGwCkXxH+Pn+K5PfPJ8n/4f/EEl37ECli2EGRj7Gz9msN
bA07z7RvERYXAs420ziFDlvfV6U2cgjzBOCYFYC6tZ1NC0sUg96Qneu+Ig8V
hW0jDonX2G8hhhSp2pUl3lefEY90vuwL8gFi1pkcDixAVj24Gv/z+h2B4/nD
w7sXZK9hDiS89V1Bbi18V+qtIZ/ZRiGEOFv0QCgjpsWT5FNasDYfXd3Xuunr
le1osYC3mxLx3ZqIlWUxMFpDhwWMh3C5SMFlQJYuFKYrbTkDbEIKDIBV2lB0
rqUjADel6gOATiH75z//5+6ri9PFyy9++om+0UWFfFun6IfxkbPTL09/+mmu
7ltb8PfwxOzw9XvL7tIv52fzUx3MPtkO+54Fec3SRxZVpqbdQdVPQCY/RD74
1HG12dOhCk6yfFp6UB5TCSGTrBggsSHeoXdxiL4CRJ8hxDc+Ah8wm95Cbm8c
5ReFLORT8SKlh52U4jiDWMc80ZoQFXmTc1weHbmAIvKzbEG9bH34hJrU1u+O
0lMyfos4VBOW0pmlVqYyKCvsQiRUayMVv8ZaoAS+6IaSzJFkPgIthrm+4rwF
3Eo6osCCFvErSkY5icDsgPSE8HTfpNTC6nFPR8Z/xKOJL5V9QmIyaZJjGeXV
nkwTw7c+5LXArXzEzsa+o2TWcd8KmJAdBU6+gReJ3JHCmrjRTDICdBsIlLst
nfC4sYIZeMdiQeZnooQZQFJUPUkAfWkbh7BjPVY58OfzS3//ArQVwbSgFhSb
HYSFylRDzgxeHyHD0G82RB/sPOyxTjBOoVXJJrwCH8IMCiIQiX0E8EdPsOqx
cCRmBnFIhaJ6Rs+DrLo916IZZRViCysckxKqihW2AC89vEvsqexHF+IQCgk7
akTo63aM+jOBGlScKYXgI1KTvmuQBoo/p/y5R6C5lJi/HS2wxNBdIg5810lF
Gisu2aDoLLl0IUcDIoViv59GSNTSTN+C8IF/icCfbedRc+yc6PnBdrVjJbVn
sOhHqmbgjKCfXX93//BsJv/qm1v++e7tH7+7unt7ST/ff7N89274QZ5Q+OX2
u3fpe/ppfPPi9vr67c2lvHy9/MszsefZ7fuHq9ub5btndIwIXKgBFwRcytdE
ckg8SjqTeXiFXybEevqGiFV9RY7RlNedRS6xZyGsTOQlEU+iAqxWA5JthVQg
xFis8v9Y5fXrtMpnwPfA9eQb6DXiMoEfeLcj1URrEzVChOSAZYxxYq8nEZsp
Q99Wld+Fc6V+pZcoEo5sQtxbvJrWnSEvXbHlpJ4UNFOTwmM8EusbdQQ3YyEH
MwJ9O5u4hCQBjiK1lR/v+VypbktqAveEGmLmbNFc7JR8Gd87NLQwrZga4Z7q
vzO05SQDrRdIPtp1SMu0ZT62+O/4foc7Kc5BKXlhkGSPrFFQcegN101qDja6
O1I8TEVpugdpPKUiyfpBTPgNf7J2nfgMW7quzHAgtDZ2R4AgZZShgDjWlijZ
hRp7fs+iOCsmijIHh2Xj/pD3BquYpVI+wSFdVPTcUfRkbuVfmDFJvgzSXN5h
4YCqg5JGNS2tLGV+7TZ9J1y/g+ZMiqlG8kAOaeiMnnF9Ov9SJWUldWYMbqQT
QmqgCDCYWHdo4SsQeFb2Ql4oqms8pdB5lCQWGstgpPOl6ABuSEV+ZV2ZDZ2J
NYWNQhg2+3KmRi85GFw61Ohqn4tkUkbiKzoaHJB1TUKgGk3NViY84YxsJ6c5
IlN5/9i3I8xKGP6zKI2gVp/mzmGYVtSVCKapYvJqIxoSBhT5Ba9QMHxDApyb
GX26oPC+XGShO9cfftC/v7q51G+G4BMZGbWFpD2h3rHkc0cjyTO+yq4nmV2T
QpQ+gjYq6RWjBnTwOVrToYCmAsfNo4EnuY86PVyRjDZtS2tE9XIxR3pDNHqY
zxqw8mlF2OJxZMC2BjETXSdOJZMo5nsbpQMgB+GcP3JUmAoYTMOD9BCXgMoy
BEqyE5VuTDLmJg75sSzihWwFoqEWYBqGHOAR5ITxBBwcnUoJJBwfI7L6cuGg
mRVviXwLOF9Ys0GZiikjUTU8gwmODxPhjOUPQIxWmvsP8rGYpSgrJtKFdNsn
UM5Jk1+mcKVoYe8PP8hAIH/ECxFHuVbygjqjPgRuaT/8qB+m9kjFaqiSRJWM
Te4Fc1f20JEk4FyDMzNJQDVaaCk9GQGQV1L7/zAphJOKMPAbWSYhqE2DU4l6
YOih2S8tlZtU/Uf9N8nJVIjmh9ypEpP3rAxLt15b0mLoCHx9UJtla3IVB4vF
IqsyWJNyn2qLIKkyrAq5+yehUpm6zYpyoFesFUjo5NqgeM9jOjfk1v+IewZi
V4fETp7Jwx42yTYbUP1QstdkYG6VE36Cyk3oWC7ArIaTKmGK/ZDsSZp+qBnw
62uEcx9yGd/gDCHloaesdcUEIUNcYHAtfRhivO3hUSUdL0l2PEIfr91HUS7z
NM8KYCbq3F0BpVdsG4eIa2jOoeXHjgftIecX9LwvqGiUiktfaSu7EQqStDza
ZyBJO+ogkgAA+6QWRbGHGMvEIpNjUeuHfjY4dNGpR1mRxwoC4bqvUgekjsab
vqmDrZ7SjvJ+37JovTpgTn0BEWNAT2qZOoVfhpFAl1WPyZNpSesLVAxgQ1nH
rLbQz29u397d3d69QPOvX+LXP1/eXi+vbl5o7hVWoybOnn4ifkuekGkU+zse
MAL5mMsKkNBXpptl8QRyZ+tU+p0HAEkG0LfMGoAKQh1lZAFhJT0UVfKuI+kR
f9GjaXACjUDzBeoK07wlcNvBuke6ECDZ0fQwUhc7PgXVcsI5z+M1dM3c1Q/s
33hqzEQADYclfnxsINN5i8hidMAHXqLsIJ1IwrGQykLwGOc3E9yOKk+ZkndN
5DvMZmSTYRQIObUZBoOIdVIjatCptCRrjKkDAHSNfOxDpGkIz0uNq/ou9/bs
I2W5Q5dpYi60qaiNFu/ISLZ5KtSS8GtIajCj5kqQigOtuhbB8XP1m6eDXH1m
qaSmUiPVdjLdpWxL7c5Yk10uy5wEW/TwaNojVAqlczIFb5AUFHDUre9igl+u
Zp8yWK5ro+HsBvVLdqOhcdXxIpVE+4AwxT1/a4BougopTDV5msfrOppHLon7
cQo6AskwlexRG1oRxOOEqhpogQagZR7lpECLSEvcDGl8H33b5ibuyNmHkdaB
/D5QRerADSkOlNtpnjYOOrBO+OTUBXr4oOSdxvLsZ8K8iNeFQYryozeU3c8v
bpbXb18MvSKE4aYaq1i2SzDIpvKJVX4+8RYXGb9rsB+TxjjQC33Xdo77WthA
qjjr/Z0JaVqIDaS/opG/o+nTJ7KXzCEZT/WSp3h665F62EmlAZ+wOdAqSoJP
lYs25Wujr94/vdJLKVD6+TIfea64Pxjms/bJgTn5/XBQyyV0Eu+4b1PFyaJk
ptKkgsZbXU3DwHFwlfjZpKsGOOAbv7PczEourLDs4NKxKSwPJno06Sa413YK
ThFSa1AMXeoQlQq3ZJhIx5dKja9QiJJzqDLQcC5RH7p1fLUEQtAycXwI1DTb
1uhXKh6z+3R8X+SxGjtiRqPMWgbpvt9so0rDF0BPuDtzMM/6quwFUEd6cNyq
Csycnd3hxFFe5RZXJFQe7GdONqAGhy4Rh6AwSh/0rbWt4C1hVJrDgXZcl2M0
mbpz94plCo9Gm7pfAi8Hng++6VCfUIg3eYjp1uN8xXbjfZfJ0xXI1vFqQ296
V3L18o3ocjJQcBXcPyxjhKYQdIXHCcVMxLGZTAGBnHsiOrq9lpnKwcQtzTsn
XRf8TheBXAJa35KUoHk71uXbR4LrUDzk2koT01EzSnlARqJJHOZuGyut92R9
ur4no41eQ16hArJZOKJ0Cuuq50oh4cp8n+cIPxNp9/i3D4TBu68u9NvSQbWc
a2SSpt9//ebLMwl+SPdCI9Cy/0E2NGryanrjSKjgq7tqQ5m0rQ+npnG89wUp
DVKBan42T2qEUSBbZDnOLVOF+ev5QgZHA2k7lurpno/CLaPn/w2Tj1UqIog9
YeJscXoq5cSirypIj5Gzak/pVnLaUBwRaW65ktLgSFS+IUjma+ZJQ82VWyhX
TikjdmzoqP/jBs+m8wzGU+mGW/ooMoeQmP/8Qd/vQRh1gAZu6I8ZHADhGhgB
ierayYUPaYck47GZ79p0oSvrOQJHonOFDlq2kxnkZ3Tf1zNCLw4myanrpRtd
v3pyNCwJ+UkmkzwhxoaUGtitco+2clu6GJU59/3bC8WiVUzNXhRa54vsqSrL
ypXUIcOahk0CNTVQ6iRhcSxenCwCXJ0fLu4kGgx/ulFONUCli53j/QdVloOr
JmZPWmKY1Qy9XBgRPL0r6HlcIzWkD2N9c+kWGvqfJ7TJDwzbyhTMObtPUMSs
v3Nwh8gSac4QsGU53mGLOjJFpLtxfo0fJmHf5Lkb9VIjpz6Cp0chUZED+Npv
StTT4seLbngsiiYi+7eGrksNOSLv0DsRkN537skUx3A0ddTgFdLzJA1b+kMQ
VkfkmzYtIrd2vO7N8uE4NqUTG2ll7HhI8FJWjwO3dB9Ca5XEs2ntq+XN8j8Y
zDd9qNzUSJlivGWhV7NDMjemJFSZGUvkCxUd3+Q/h+DuDYgkbk2m0Y+WaZdN
WhYUPwhCmSOlw+a/gAAktpLWpnnU19C4Tn9b+eJxpq/7R6q593BgB/prZvoP
pGDv/J5Bqf5A1676GqLB9HyCLPw6OMTu0h9PsKT5F1zi94BEJgAA

-->

</rfc>

