Build It and They Will Come
During the Kong Summit in September Dennis Kelly, Senior DevOps engineer, explained how Kong became a core service—and an integral part of the architecture—across brands at Zillow Group. Starting out with a single use case for Kong Community Edition, Zillow advanced to proxying production workloads at scale with Enterprise Edition, automating deployments with Terraform. Kong’s power and flexibility fueled its explosive adoption at Zillow. This talk will give you the tools to set up your own enterprise-ready Kong clusters in Amazon Web Services (AWS) with minimal time and effort by leveraging Infrastructure as Code (IoC), creating a field of dreams for building your products.
See More Kong Summit Talks
Sign up to receive updates and watch the presentations on the Kong Summit page. We’d love to see you in 2019!
Full Transcript
Dennis Kelly, I’m a Senior DevOps Engineer for Zillow, and a lead on our API strategies. I’m also an AWS Certified Solutions Architect Professional, so I have a little bit of experience with AWS as well.
Today, I’d like to talk about a number of different things. I was looking at our story at Zillow Group and how Kong has evolved. And it really came down to, it’s like we built Kong and then we had just this explosive adoption because Kong was just there, and so it was like the movie “The Field of Dreams”. You build the field and they will come.
And so today, the agenda is: we’re going to give a brief introduction to Kong and what was attractive to us about it. We’ll talk about Kong at Zillow Group, the evolution there, and we’ll go over our architecture that we use in AWS, and introduce infrastructure as code with Terraform for how do we deploy our clusters, then close with some thank you’s and some time for Q&A.
So what is Kong? Everyone knows hopefully by now throughout the Kong Summit, that it’s an API gateway. It really is just a proxy between clients and your API. If you think about going to the bar with your friend, it’s your local bar. He’s coming in with you. It’s like “Oh, let’s order some Manhattan’s. I’m like, “No, wait, I got this bro,” because you know the bartender. So the bartender is our microservice on the back end. You’re the client wanting to request something. I give him the wink. He comes over, bypasses the beautiful women that are also waiting in line for a drink. I’m like, “I need two Manhattans, straight up with a twist.” So you could have ordered that yourself, but you may have been waiting a little bit longer. You may have not gotten the response that you wanted from the bartender and so that guy in the middle facilitated the request, gave us some quality of service.
The beauty of Kong is with its extensible API. We can add a lot of functionality there, as utility features into the microservice architecture. You look at the server itself, it’s built on Nginx OpenResty and then the Kong server itself. We’ll go into a little bit of detail here about what that is. So Nginx is an extremely strongly powerful web server, very high performance–powers over 400 million websites. And so if you look at that as an open source project itself and the community behind it, it’s very attractive. OpenResty, which integrates Nginx with the Lua Just-In-Time Compiler, basically provides you an easy-to-use and scalable dynamic web gateway. That’s what Kong uses to build itself on top of it. And then with that, Kong has its own plugin architecture to where you can also extend the functionality of it. It’s highly extensible, scalable, and restful, making it a great pattern for infrastructure as code–and also platform agnostic, which is a great benefit for us given the different types of architectures that we use.
So Kong came into the picture at Zillow Group when we were looking at sharing APIs between our different brands. Zillow Group is actually composed of brands like Zillow, Trulia, Hot Pads, StreetEasy, you can see them down there at the bottom of the slides. Anyways. We have these development groups wanting to come in and share their APIs and they’re like already amped up and ready to go. It’s like, “Oh, let’s just set up a VPN tunnel between our two data centers and then we’ll start sharing that API.” Then the next group comes along like, “VPC peering.” like … “Oh, we already set up this as a public API.”
You can see the headaches already starting to form with the operation teams. It’s like, “Okay, let’s pump the brakes here for a second.” Those are obviously old and busted ways. They’re not going to be a consistent pattern. It’s not going to be scalable for the future, not secure. So we came up with some tenants for what I call the new hotness. We wanted to build a service that could be consumed by all of our different brands. That way when we’d look at all the different architectures and data centers that we had, we needed something that would work in each one of those.
We wanted that to be consistent and secure for the microservices as well. And we were looking for something that was standards based, and also quick and easy onboarding. I think that really translated into a story about this needing to be completely transparent to our development teams because we wouldn’t want them to go back in and have to refactor a ton of stuff in order for Kong to work. That was again, one of the big attractions to Kong: we could abstract a lot of that stuff into Kong, unify a lot of the functionality into one spot and then not have to be dealing with, “Oh, we found a security bug in this utility microservice communication package that we’ve built.” Trying to get teams to upgrade that in a consistent way would be a nightmare.
So this is where Kong came aboard. Working with teams down at Trulia, we came up with this architecture for sharing our microservices using Kong. At Zillow, we have these things called brain dumps. They happen every Tuesday, where you’re introducing new concepts and new services to the company. And so, I presented on Kong on Tuesday, August 15th, 2017, a little over a year ago. All of a sudden that’s when Kong blew up. Week 2, I had meetings booked out for weeks in advance, basically taking up three quarters of my time. I had 40 Jira tickets and two weeks of people requesting Kong. It was a pretty overwhelming–thank God for PMs. Right?
As Kong was hitting the water cooler talk, it was starting to gain a lot of momentum. It was like “Okay, we’re sharing APIs between these brands and we hear about all these other cool features of Kong.” And so it was like:
“Can I? Can we do public APIs?” Well, yes. Yes, you can.
“We want to do some cores with that as well.” Yes. Yes, you can.
“Rate limiting?” Yes, you can.
And so then all of a sudden playing in the back of my mind was that song by Tribe Called Quest, Can I Kick It? Yes, you can.
“East/west authentication.” Yes you can.
And now I’m feeling like the Kong guru at Zillow. It’s like, “Can I?” Yes, you can. I just wanted to have the tape on a big old boom box. Just ready to go for any time someone came up to me and was “Can I do this with Kong?” And so like the last one was lambda. Yes you can. So can you kick it? Can you Kong it? You absolutely can.
So I’m going to get into a couple of specific use cases that I had introduced in that last side. One was our east to west authentication. When we think of Kong API gateway, a lot of that is north and south, and that is basically data coming in and out of your data center. East, west is that traffic within it. So we had a specific service, it’s an email subscription based service that manages a large number of campaigns and people subscriptions to those campaigns, and they were definitely concerned about the pattern of, oh, hey, I need access to this, I’m going to go look at this other service, copy and paste code from it, and then my stuff is up and running and then all of a sudden you have these inherent consumers that you’ve onboarded and not known about it. Because email can be such a tricky and very spammy thing, they didn’t want anyone just having access to it that and they really wanted to control access within that service to specific endpoint.
So if I’m creating a campaign for my specific microservice, you’re going to be limited to the scope of that particular campaign. And so they came to us with this potential opportunity. We decided that we’ll create an API endpoint, for each service route. We then had a one to one relationship for each API endpoint with a white list group. And then for each of the microservice consumers that we onboarded, we created a consumer for them, and then added them to each of the groups for the end points that they needed access to. And then, for the API keys we use our own version of Vault for escrowing those values. So the service owners themselves don’t even have access to them. Those get substantiated on deploy of the application.
And then came along caching. We had an old service that was a struggling with the current load of our website. It was a service that was hit for every home detail page. So basically when you go to Zillow.com and you’re looking at a specific property, there were property attributes that were being loaded from this service that we just couldn’t keep up with the current load, it was an older one that was tied to SQL server and they were thinking about browning out the service until they could build the replacement using DynamoDB. Then they came to us, it’s like, “Let’s do some caching.” And initially it was Squid and Varnish and our ops team was like, “We don’t want to get into maintaining this” because when the development teams come to you and say six months, yeah, it’s going to be done in a year.
And this was at the time we were starting to evaluate a Kong enterprise because we are really starting to ramp up our workloads to enterprise levels and it was becoming a core services are. And so not only from the support perspective, but looking at this caching plugin, we went into an evaluation of Kong enterprise and found that this was going to be a great solution for us because then we’re not introducing new technology that we had to maintain. We were already had that Kong infrastructure. We’d already built that field of dreams. And so onboarding this was very easy. That, And we looked at the complications of caching with other solutions. Having Redis a backend where we’re warming the cache for every single node at the same time. And not having to do that on an individual instance basis was a really powerful advantage for us.
And so this is when Kong enterprise into production, we had looked at what we do the amount of data that we wanted to cache in order for the service to be healthy. We ended up sizing our backend Redis appropriately. We were getting about 70% hit rate on the cache and it brought down our average latency from 25 milliseconds to 4. And we were really impressed with that, having not implemented Kong and for a caching solution at all. So it was, again, really impressive for us.
And so looking back at a lot of our factors of our success, obviously Kong played a big part of that, but then we have the Zillow Group core values that we move fast and we think big in that ZG is a team sport and that we own it.
And so it was really great to see a lot of the different brands come together, embrace this idea, collaborate, build this solution with me. Along with that, it was very complimentary, a lot of the devops principles that we partnered with our customers, our development teams for success that we automate, automate, automate as much as we can, we make things self service, and we do things in a way that allows us to iterate quickly.
And then again, the power and flexibility of Kong just really opened up a lot of doors for us. And then Kong just being there caused people to really think differently about how we were doing things. And then lastly, there were a lot of features of AWS that we took advantage of in order to scale out to enterprise workloads. And so with AWS we ended up leveraging a lot of their best practices. So in terms of high availability, in each region they offer multiple availability zones, and these are basically separate data centers within a geographic region that give you redundant everything at every level. And it’s very important to leverage multiple AZs because if you’ve ever used AWS, some of those go down sometimes.
Then the ability to elastically scale and I think a lot of people when they think of scaling, it’s just upward and onward where it’s like I’m only ever going to be adding more. And I think one of the important tenants in the cloud is that if you really want to see that AWS savings at the end of the month, you also need to be able to scale down and it’s really important practice to implement. Otherwise, at the end of the month, it’s like, “Why am I spending a million dollars on this? I thought this was going to be cheaper?” It’s like, “No, you need to scale both ways” and it’s just like those sweatpants that you put on here. Here, I’ve been eating well at this conference all week, drinking free drinks. I’m going to expand those sweatpants out, but when I get home and start working out again, they need to still fit and not fall off my ass when I get home.
So scaling up and down, and also scaling horizontally and vertically. So when you’re looking at the database instances, the EC2 instances that you’re using, you want to be able to increase the size of those instances and then you also want to be able to add more instances to scale in both directions and then AWS has a lot of tools out there that help us with the automation process. Then security, even though it’s the last slide, or last item on the slide, it should never be the last thought. It should always be an integral part of anything that you do and also realizing that in AWS it’s a shared responsibility with you and AWS, that you should be leveraging a least privileged model that you only introduce permissions and access as they’re needed and then using security as code will allow you to make sure that your policies are enforced.
And so we were looking at our AWS resources for Kong. We went with PostgreSQL. Well, we just have in house experience with it. We’re very comfortable with it. We didn’t have any Kasandra before, and for the way that we wanted to manage and scale Kong, it was the right fit, and we went with Aurora because again, of the enterprise aspects. You look at RDS versus Aurora, you’re getting the multi-AZ clustered managed auto patched, automated backups, a lot of enterprise features that will help you withstand a disaster.
And then because we were using rate limiting and also the caching we wanted ElastiCache Redis back in, we used elastic cache for that, again, a managed service that scales out has and uses the write and replica technologies and then EC2 Auto Scaling as we added new nodes to the Kong cluster or there was say a hardware failure, in AWS we wanted to be able to replace those nodes or add new nodes as needed.
I think of one really cool thing is that’s often overlooked in AWS is the EC2 parameter store. It’s a great key value, secure string service that you can use to protect your data. And we actually use it for our database passwords, API keys, a lot of the sensitive information that we don’t want sitting out there in a repository or in our Terraform state files.
Again, another important pieces of plastic load balancing have that in multiple AZs to protect your services and scale out. Then CloudWatch, you need to be able to monitor and alert on the health of your services. We used, IAM for our instance profiles on the Kong nodes, so that they can then reach out and get access to various things like to set EC2 parameter store. So we’re not embedding keys in any of the nodes.
And then again, with security groups, least privileged model, everything that we did in AWS was really locked down to the specific things that needed access to it. So our load balancers can talk to a Kong node. Nothing else can talk to a Kong node, you can talk to the load balancer, Kong nodes can talk to the database, and everything is very secure and locked down.
And so this is what our architecture looked like and still looks like. So right in the middle, you’ll see a Kong cluster that has, again, the auto scaling group where the nodes in the cluster can scale out and scale back in as needed. We have an external load balancer to accept connections from the public internet and from other Kang clusters in different data centers. And we only expose that via SSL.
Internally, we have a load balancer that can do both HTTP and HTTPS depending on the service. And then for the Admin Gooey functionality and the enterprise edition, you can also access that via SSL and then the Admin API as well. And so the way we designed this was that we wanted to be completely transparent for the microservices. And so our consumers actually hit a Kong end point in their local VPC that adds the API key for them, forwards that to the remote Kong cluster, it validates the API key as that consumer and then forwards it onto the microservice. So that way, we in operations can seamlessly do key rotation without having to impact a service deploy or have developers reconfigure things.
So in terms of provisioning resources, infrastructure as code is the new buzzword, and it’s a great buzzword in terms of what it actually implies, and so what is it? It’s just machine readable configuration of your data center and this can be a script or a declarative definition of what you want your infrastructure to look like and how it should behave and the benefits to doing it this way is that you can then version it, you can put it in a repository and iterate on that and be able to revert back to a previous state as needed. It’s also shareable and that was very valuable to us at Zillow because we have multiple brands, multiple devops teams, and multiple people provisioning and it’s reusable, In that way, we didn’t have to reinvent the wheel at each step. We were always using the same code together, and then repeatable because even within a devops groups, we were deploying multiple Kong clusters and so we needed a way to ramp up quickly to meet the demands of our customers.
And so at Zillow, Terraform was our way of doing that. There are obviously other tools out there for AWS, but we had already standardized on Terraform and so that’s how we did Kong. And Terraform is just a tool that codifies a APIs into declarative configurations. And you can go to the website there are some of the additional benefits to Terraform is that it is open source just like Kong and has a great community behind it. Another benefit is that it can really manage your complex change sets. And so if you’re looking at introducing say a new resource or modifying an existing resource, Terraform can go look at what’s already existing in your VPC. Compare that to the changes you want to apply and only make those changes. It can also manage resource dependencies. So if you have a security group that depends on another security group, or if you have a EC2 instance that’s going to depend on a database, Terraform can help you easily manage those dependencies to make sure that resources are created, modified in a destroyed in the appropriate order.
So getting started with Terraform, you can just go down to their website, download it for free, it’s available for a number of different platforms, and you can use homebrew on the macOS to get started with that as well. You just unzip the binary, place it into a folder that’s in your path. And here’s some instructions to help you do that. Super easy to do. Then you just verify your installation. Here, I’m on my system called Awesome. I have another one called Booya, Terraform –version, and it will give you what you have currently installed. It’s actually important to take notice of this because Terraform is actually a project that iterates really quickly. They release new versions all the time and it’s important to stay up to date, you actually may find when you start collaborating with other people, if they’ve downloaded version after you and have something that’s more up to date, you’re going to have to upgrade in order to modify the state.
So getting started with Terraform, there are a number of different providers that you can use to basically describe provision change and destroy resources in your environment. And there was actually over 70 officially supported providers, AWS being one of those. It’s been brought up in other talks as well. There’s some community providers as well, and Kong being one of those. So a very flexible and powerful tool.
So Terraform configuration, that’s basically just text files. There is a Terraform format where it has a .tf extension, and this is actually the preferred way to describe your infrastructure because it’s human readable, you can add comments to it and the declarative format for that is HCL for Hashicorp’s Configuration Language. You can actually also do it with json with a tf.json extension. But this is really designed for applications that would be generating the Terraform for you in order to apply.But again, realizing the limitations of json, it’s not as human readable and you’re not going to have the comments.
So configuration semantics for Terraform is it basically looks at all the files in your directory, orders them alphabetically and merges them all together. And so if you say create a definition of a resource in multiple files, you’ll actually get an error on that merge because of the multiple declarations. There is a pattern for overriding. I’ll list that here for further reading on your part, but I’m not going to go into details, just so we can focus on Kong. So yes, Terraform is declarative, the order of a reference in the variables within the files don’t matter. It’s going to merge them all for you. So some basics, typically, you’ll lay out a directory with a main .tf, and this is where you specify your provider. This could be the AWS one where you’re going to give it some credentials, it could be the Kong one where you give it your Admin API token.
Then you define resources. And basically any file name that you want, typically would be it would be redis.tf or aurora.tf to describe the resource that you’re trying to provision. It’s just basically a component of infrastructure. And you can actually have multiple definitions within one file. A data source, typically stored in data.tf basically references an existing piece of infrastructure that wasn’t created within your Terraform directory. It may have been created by someone else in their Terraform directory, but instead of having to statically reference, say a resource ID like an ARN, you can use the data source to have it pull in that information for you, so that if you were to say rebuild a VPC and you get a new ARN, you’re not having to update those static references in each place.
And then variables are basically just parameters that you can specify. Think of any other programming language, even though HCL is not a programming language, it’s the same principle. And then lastly, modules and these are basically a Terraform directory of resources that are encapsulated into one group that you can be reusable.
And so we developed a module for provisioning our Kong clusters in AWS. And I’ve tried to make them pretty low barrier to entry in terms of its prerequisites. Obviously you’re deploying into AWS, you’re going to need a AWS account. We do everything in VPCs and so you’ll have to have your VPC setup. Then you’ll have public and private subnets, ones that are exposed to the Internet and one that are completely internal and this is actually a best practice for AWS so that you only expose things as needed. And we do labeling. We do a lot of tagging in our AWS accounts and this way, again, we can use data sources to reference those without having to say have a static reference to a subnet ID in there. And so for this module, you just have to label your private and a public subnets using the type tag.
And then a default VPC security group. This is going to be for giving you SSH access to the Kong nodes. So it allows you to choose how you want to do that, whether it’s, say a corporate subnet or a bastion host. And then you’re going to need, an SSH key pair for SSH into the Kong nodes, a SSL certificate for the HTTPS on load balancers. And this can actually be different for each one of the SSL endpoints. And then lastly, you just need Terraform, so hopefully very low barrier to get in. You need your AWS account setup and Terraform.
And so happy to announce that we’re releasing this open source project for everyone to share. It’s now Zillow Group/Kang-Terraform on github, and we’ve added it to the con hub that was released earlier today.
And so here’s an example of building your Kong cluster using our module. Pretty easy to do. And so this is the point where it’s like, “Wabam. You got your Kong cluster, I’m selling it on TV.” Like one of those, yeah, “It’s now yours for three easy payments.” And that last one is going to be super complex because don’t all have UCS majors feel robbed for never having using calculus throughout your entire career? So I made that last payment super complex so that you can apply some of that, feel like you get value for that CS degree. And so provisioning with Terraform easy as one, two, three, Terraform init, plan, and apply. And again, wabam, you’ve got your Kong cluster, and this is my Oprah moment. It’s like, “You get a cluster, you get a cluster, he gets a cluster.”
So while that’s actually applying, and setting up all the resources, you actually have to go and do some things. And so if you go into AWS console, into the parameter store, you’re going to want to add a password for your Kong cluster. And basically, you can set it to whatever you want. I don’t have that in any of the Terraform tf files because again, we don’t want that being committed to a repository and then being exposed to people that shouldn’t have it.
And the same thing with the license for EE or AM for the bin tray off. And this is only if you’re doing the EE edition and you can actually do CE and EE with this module. And so the Kong nodes are running minimal Ubuntu and I’m actually a super huge fan of it. Came from the Debian world long ago. But Ubuntu is really a modern operating system that’s built for the cloud. Minimal Ubuntu has a very small footprint. I think it’s under 100 megs. It’s very secure because again, we’re reducing the surface layer and scope of what’s being installed. It’s really fast booting. So I can provision an instance in 90 seconds from scratch. Again, it has an optimized kernel for AWS to give you even better performance.
The Kong service itself is installed for you. It’s supervised under a program called runit or optionally also called by it’s command line tool sv. And basically, this will manage the Kong process for you. So if somehow it crashes or fails, it will restart it for you. We’ve also added a Kong splunk plugin for logging. So a great segue from the previous talk where he’s talking about doing splunk logging, the splunk plugins now released today.
Then automatic local log file rotation. For our declarative management of the API endpoints, we started out with Kongfig, and this was actually before the Terraform provider for Kong was released. And so that also gets installed.
And so some of the cool hacks I saw were interesting when I created this module was how do I do ELB, elastic load balancing health checks? And this actually became the first Kong endpoint. And so you have a slash status on the Admin API. For CE, we didn’t expose our admin API on any of our load balancers. And so I’m like, “I’ll just create a slash status on the Kong gateway that points to the local host status. And so that way I can do health checks.”
Then enterprise version was just a modification of that because if, again, if you look at, can’t really do off when I’m just giving it an HTTP endpoint for our status. So with enterprise, our back is enabled by default and what I did was created to monitor user, with it a token that has access to slash status. Then we create the Kong slash status endpoint and modify it using the request transformer plugin to add the Kong admin token for the monitor user, which is just monitor.
And so once you’ve provisioned your cluster, there are some additional steps you want to take, by default, the root password is just KongChangeMeNow#1. And so you’ll want to log into one of your instances and change it. It’s not too much of a security threat because only the Kong instances themselves have access to that PostgreSQL database, but it’s definitely a good practice. You can then update that root password in the EC2 parameters store. So as you provision new Kong nodes, it has up-to-date a configuration. Also, you’ll want to enable IP white listing on that slash status endpoint so that way you’re not exposing it to the public on your external load balancer. And then for enterprise edition, the default admin user in our back is just zg-kong-2-1. And obviously that’s going to rev with each version, but you’ll then can log into the Admin Gooey and change that.
I’m not sure if the Kong Terraform provider can do our back yet, but Kongfig can’t, and so, we just do that manually through the Gooey. You’ll also then want to update that value in the EC2 parameters store because that admin token can be used for your declarative definitions of Kong endpoints. Some additional features about this plugin is that almost all the settings are tweakable. You can change your EC2 instant sizes, your timeouts, your thresholds, and also resources can be optionally provision. So if you’re not using Redis, you don’t have to enable Redis. Say you have an existing PostgreSQL database that you want to use, you don’t have to provision Aurora.
Then a big thing for us is CloudWatch. And so there are CloudWatch actions that you can define that will little trigger on the various thresholds that can be tweaked.
And so this allows you to send email or send a pager duty if a Kong node goes down, you’re hitting four or five x 100 thresholds. You can also add bastion host access to all the resources since everything is locked down. Say you want to manage the PostgreSQL database outside of that, you can add that to the bastion hosts cider blocks.
Some recommendations for running in production would be to use the C52XLs, this is actually Kong’s recommendation as well. And if you look at it, you’re going to see your host running maybe 3 to 5% CPU and like 800 megs of RAM of the 16 gigs I think it has. And you’re like, “Why am I doing this?” It’s for the networking. If you look at AWS instance sizes, you have to evaluate not only the CPU and memory that you need to provision for, but the network is very important.
If you look at those T2 instances, you’re going to get burstable network speeds that are going to be sporadic and they’re not very a kind for production workloads. And so this, the C52XL gets you into that 10 Gig range, which will scale for a production. Obviously you’re going to want to look at those CloudWatch blogs and then be able to implement auto scaling policies about when you shrink down and when you go big for your production workloads. And then it also allows you to add additional tags to all of the AWS resources. When they get provisioned, they’re going to have a service and description tag and basically when you provision the services, the name of it, it’s going to be zg-kong-2-1- whatever environment you define. So dev, staging, prod. But then you can also go in and add additional tags in the module itself, which will be passed to every single resource that gets created. And this will help you with your auditing and billing.
Then highly recommend you send those Kong logs to a remote endpoint. Mentioned splunk and that’s what we use. And it really enables you and developers to have visibility into the health of the cluster and the health of the application. And so our teams heavily rely on this to look at and monitor and alert on the health of their application. We use it to do the same for the clusters. And one really cool thing is when we release the splunk plugin at Zillow, everyone immediately went in and started looking at like latencies and they were just blown away. And I’m really impressed with how performance Kong really is a to see an average latency is zero milliseconds on processing and a really low p99, I think it was like 4, it was impressive and it just boosted everyone’s confidence in Kong
Then lastly, we talked about API key management, and because we’re again putting this in declarative form to where you don’t know what that key is. We actually use pwjen, it’s a little command line utility for Linux and for Mac, to generate secure passwords so that, that’s what the -s does, is it makes it more secure. You can then specify the length. And so for API keys we use 32. Then the 1 just says give me one password. This way, as you go into the declarative world for your API endpoints, you can have when they give you, say a pull request to create a new end point and there’s an API key that’s supposed to be in there. They can actually just provide you a token, you generate a value for that token and using pwjen, and then you can store it in something like vault or EC2 parameters store and have that automatically inserted into a render config as so applied.
Some thoughts on EPI management, we’re using Kongfig currently. Again, it was because the Terraform provider didn’t exist at the time, but it does have some limitations, that doesn’t support services and routes and the newer versions of Kong. And the development of it has really slowed down. And so we’ve been looking at some alternatives. Terraform I think is going to be the go to in the future because then also puts it into the same declarative language that we use to revision Kong itself. And then there’s also ansible.
Then you can set up policies to make this completely self service so that your customers can then send you a pull request for the end point that they want to create or modify, and then as you merge to master, automatically apply it to the appropriate cluster. So some ideas that I’ve had for the future, again, this is going to increase the the entry into using our Kong cluster, but we’re looking at maybe using a custom AMI, using HashiCorp’s Packer, and this would basically allow us to configure Kong and a lot of the static content before it even boots. And that way we’re only applying database passwords and some of the dynamic stuff configuration as it comes up.
Stats d integrations so that we can get more metrics on the Kong nodes themselves, an additional CloudWatch triggers so that we can alarm on CPU and memory and disk. Then also provide some example auto scaling policies. We’re still in the process of setting that up ourselves. Right now we, we provision however many Kong nodes we think that we need for that given environment and we just keep it statically there. That auto scaling policies replace nodes as needed as say AWS changes hardware underneath us, but we’ve really haven’t had a need to really scale that out yet because Kong is so performant that just haven’t had the need to scale beyond the minimum amount that we feel we need for reliability.
Then we’re also evaluating the PerimeterX plugin, PerimeterX is an enhanced bot protection framework. We use it on our primary website and then to be able to offer that within our Kong service as well would be a great compliment to it. And then instead of having all of our different con clusters talk to each other over the internet, we’re thinking about setting up, say a transit Kong VPC where there’s only Kong in it and we don’t mind peering with all of our different brands because at some level we do trust them and that way we’re getting better performance at the VPC level.
And then with the exciting announcement of 1.0, definitely looking into the service mesh a opportunities there, but even before the Sidecar proxied was released, I was thinking, what if we just got rid of all of our load balancers in front of our service and just had Kong? Because then you would have all of those great features available to you and really basically have a service mesh within your data center without going through refactoring all these things are adding sidecar capabilities even though it’s out in the network and not in the sidecar itself. All those features still exist and are available to you immediately before, say, migrating to Coobernetti’s or changing out your entire stack.
So big thanks to some people at Zillow Group. Toby Roberts, the VP of Operations. Leif Halverson, the Director of Infrastructure who’s my supervisor, were great supporters of our transition to Kong Enterprise and had no problems justifying that expense. Jan Grant, who’s our PM for the project and kept me on a task with all of those Jira tickets, and then my own teams. And, this isn’t just production operations at Zillow because I’m in Seattle, but also the other brands here, I’ve got some guys sitting up here in the front row who very prominent in the implementation of all of this.
Not all the product teams we partner with. It was exciting to do this with a bunch of teams that were eager to onboard and it was just a really fun process. And then all the people at Kong headquarters, you really struggled to find in this industry people that are just so bright, pleasant and fun to work with. Danny, Aaron, Harry on the cloud team like Ben Helves also, our customer success engineer, Travis, I could go on and then all of you Kongers because I think Kong is such an awesome technology, but then to back it up with such a vibrant and cool community, my hat’s off to all you guys.
Thank you.ly came down to it’s like we built Kong and then we had just this explosive adoption because Kong was just there, and so it was like the movie “The Field of Dreams”. You build the field and they will come.
And so today, the agenda is we’re going to give a brief introduction to Kong and really about what was attractive to us about it. We’ll talk about Kang at Zillow Group, kind of the evolution there, and we’ll go over our architecture that we use in AWS and introduce infrastructure as code with Terraform for how do we deploy our clusters, then close with some thank you’s and some time for Q&A.
So what is Kong? Everyone knows, hopefully by now throughout the Kong Summit that it’s an API gateway and it really is just a proxy between clients and your API and so if you think about going to the bar with your friend, it’s your local bar. He’s coming in with you. It’s like “Oh, Let’s order some Manhattan’s. I’m like, “No, wait, I got this bro,” because you know the bartender. So the bartender is our micro service on the back end. You’re the client wanting to request something. I give him the wink. He comes over, bypasses the beautiful women that are also waiting in line for a drink. I’m like, “I need two Manhattan’s, straight up with a twist.” So you could have ordered that yourself, but you may have been waiting a little bit longer. You may have not gotten the response that you wanted from the bartender and so that guy in the middle facilitated the request, gave us some quality of service
The beauty of Kong is with it’s extensible API. We can add a lot of functionality there as like utility features into the microservice architecture. You look at the server itself, it’s built on Nginx OpenResty and then the Kong server itself. We’ll go into a little bit of detail here about what that is. So Nginx is an extremely strongly powerful web server, very high performance, powers over 400 million websites, and so if you look at that as an open source project itself and the community behind it, it’s very attractive. OpenResty, which integrates Nginex with the Lua Just-In-Time Compiler, basically provides you an easy to use and scalable dynamic web gateway and that’s what Kong uses to build itself on top of it. And then with that Kong has its own plugin architecture to where you can also extend the functionality of it, and so it’s highly extensible, scalable, and restful, making it a great pattern for infrastructure as code and also platform agnostic, which is a great benefit for us given the different types of architectures that we use.
So Kong came into the picture at Zillow Group when we were looking at sharing APIs between our different brands and so Zillow Group is actually composed of brands like Zillow, Trulia, Hot Pads, StreetEasy, you can see them down there at the bottom of the slides. Anyways. We have these development groups wanting to come in and share their APIs and they’re like already amped up and ready to go and it’s like, “Oh, let’s just set up a VPN tunnel between our two data centers and then we’ll start sharing that API.” Then the next group comes along like, “VPC peering.” like … “Oh, we already set up this as a public API.”
You can see the headaches already starting to form with the operation teams. It’s like, “Okay, let’s pump the brakes here for a second.” Those are obviously old and busted ways. They’re not going to be a consistent pattern. It’s not going to be scalable for the future, not secure. So we came up with some tenants for what I call the new hotness. We wanted to build a service that could be consumed by all of our different brands, that way when we’d look at all the different architectures and data centers that we had, we needed something that would work in each one of those.
We wanted that to be consistent and secure for the microservices as well. And we were looking for something that was standards based, and also quick and easy onboarding, and I think that really translated into a story about this needed to be completely transparent to our development teams because we wouldn’t want them to go back in and have to refactor a ton of stuff in order for Kong to work. That was again, one of the big attractions to Kong is that, we could abstract a lot of that stuff into the Kong, unify a lot of the functionality into one spot and then not have to be dealing with, “Oh, we found a security bug in this utility microservice communication package that we’ve built.” And then trying to get teams to say upgrade a that in a consistent way would be a nightmare.
So this is where Kong came aboard, worked with teams down at Trulia, we came up with this architecture for sharing our microservices using Kong. At Zillow, we have these things called brain dumps. They happen every Tuesday where you’re introducing new concepts and new services to the company. And so, I presented on Kong on Tuesday, August 15th, 2017, little over a year ago. And then all of a sudden that’s when Kong blew up. Week 2, I had meetings booked out for weeks in advance, basically taking up three quarters of my time. I had 40 Jira tickets and two weeks of people requesting Kong. And so again, it was a pretty overwhelming, thank God for PMs. Right?
And so, as Kong was hitting the water cooler talk, it was starting to gain a lot of momentum. And so it’s like, “Okay, we’re sharing APIs between these brands and we hear about all these other cool features of Kong.” And so it was like, “Can I?” Can we do public APIs? Well, yes. Yes you can. We want to do some cores with that as well. Yes. Yes, you can. Rate limiting. Yes you can. And so then all of a sudden playing in the back of my mind was that song by Tribe Called Quest, Can I Kick It? Yes, you can. East West authentication. Yes you can. And now I’m feeling like the Kong guru at Zillow. It’s like, “Can I?” Yes, you can. I just wanted to have the tape on a big old boom box. Just ready to go for any time someone came up to me and was can I do this with Kong? And so like the last one was lambda. Yes you can. So can you kick it? Can you Kong it? You absolutely can.
So I’m going to get into a couple of specific use cases that I had introduced in that last side. One was our east to west authentication. When we think of Kong API gateway, a lot of that is north and south, and that is basically data coming in and out of your data center. East, west is that traffic within it. So we had a specific service, it’s an email subscription based service that manages a large number of campaigns and people subscriptions to those campaigns, and they were definitely concerned about the pattern of, oh, hey, I need access to this, I’m going to go look at this other service, copy and paste code from it, and then my stuff is up and running and then all of a sudden you have these inherent consumers that you’ve onboarded and not known about it. Because email can be such a tricky and very spammy thing, they didn’t want anyone just having access to it that and they really wanted to control access within that service to specific endpoint.
So if I’m creating a campaign for my specific microservice, you’re going to be limited to the scope of that particular campaign. And so they came to us with this potential opportunity. We decided that we’ll create an API endpoint, for each service route. We then had a one to one relationship for each API endpoint with a white list group. And then for each of the microservice consumers that we onboarded, we created a consumer for them, and then added them to each of the groups for the end points that they needed access to. And then, for the API keys we use our own version of Vault for escrowing those values. So the service owners themselves don’t even have access to them. Those get substantiated on deploy of the application.
And then came along caching. We had an old service that was a struggling with the current load of our website. It was a service that was hit for every home detail page. So basically when you go to Zillow.com and you’re looking at a specific property, there were property attributes that were being loaded from this service that we just couldn’t keep up with the current load, it was an older one that was tied to SQL server and they were thinking about browning out the service until they could build the replacement using DynamoDB. Then they came to us, it’s like, “Let’s do some caching.” And initially it was Squid and Varnish and our ops team was like, “We don’t want to get into maintaining this” because when the development teams come to you and say six months, yeah, it’s going to be done in a year.
And this was at the time we were starting to evaluate Kong Enterprise because we are really starting to ramp up our workloads to enterprise levels and it was becoming a core services are. And so not only from the support perspective, but looking at this caching plugin, we went into an evaluation of Kong enterprise and found that this was going to be a great solution for us because then we’re not introducing new technology that we had to maintain. We were already had that Kong infrastructure. We’d already built that field of dreams. And so onboarding this was very easy. That, And we looked at the complications of caching with other solutions. Having Redis a backend where we’re warming the cache for every single node at the same time. And not having to do that on an individual instance basis was a really powerful advantage for us.
And so this is when Kong enterprise into production, we had looked at what we do the amount of data that we wanted to cache in order for the service to be healthy. We ended up sizing our backend Redis appropriately. We were getting about 70% hit rate on the cache and it brought down our average latency from 25 milliseconds to 4. And we were really impressed with that, having not implemented Kong and for a caching solution at all. So it was, again, really impressive for us.
And so looking back at a lot of our factors of our success, obviously Kong played a big part of that, but then we have the Zillow Group core values that we move fast and we think big in that ZG is a team sport and that we own it.
And so it was really great to see a lot of the different brands come together, embrace this idea, collaborate, build this solution with me. Along with that, it was very complimentary, a lot of the devops principles that we partnered with our customers, our development teams for success that we automate, automate, automate as much as we can, we make things self service, and we do things in a way that allows us to iterate quickly.
And then again, the power and flexibility of Kong just really opened up a lot of doors for us. And then Kong just being there caused people to really think differently about how we were doing things. And then lastly, there were a lot of features of AWS that we took advantage of in order to scale out to enterprise workloads. And so with AWS we ended up leveraging a lot of their best practices. So in terms of high availability, in each region they offer multiple availability zones, and these are basically separate data centers within a geographic region that give you redundant everything at every level. And it’s very important to leverage multiple AZs because if you’ve ever used AWS, some of those go down sometimes.
Then the ability to elastically scale and I think a lot of people when they think of scaling, it’s just upward and onward where it’s like I’m only ever going to be adding more. And I think one of the important tenants in the cloud is that if you really want to see that AWS savings at the end of the month, you also need to be able to scale down and it’s really important practice to implement. Otherwise, at the end of the month, it’s like, “Why am I spending a million dollars on this? I thought this was going to be cheaper?” It’s like, “No, you need to scale both ways” and it’s just like those sweatpants that you put on here. Here, I’ve been eating well at this conference all week, drinking free drinks. I’m going to expand those sweatpants out, but when I get home and start working out again, they need to still fit and not fall off my ass when I get home.
So scaling up and down, and also scaling horizontally and vertically. So when you’re looking at the database instances, the EC2 instances that you’re using, you want to be able to increase the size of those instances and then you also want to be able to add more instances to scale in both directions and then AWS has a lot of tools out there that help us with the automation process. Then security, even though it’s the last slide, or last item on the slide, it should never be the last thought. It should always be an integral part of anything that you do and also realizing that in AWS it’s a shared responsibility with you and AWS, that you should be leveraging a least privileged model that you only introduce permissions and access as they’re needed and then using security as code will allow you to make sure that your policies are enforced.
And so we were looking at our AWS resources for Kong. We went with PostgreSQL. Well, we just have in house experience with it. We’re very comfortable with it. We didn’t have any Kasandra before, and for the way that we wanted to manage and scale Kong, it was the right fit, and we went with Aurora because again, of the enterprise aspects. You look at RDS versus Aurora, you’re getting the multi-AZ clustered managed auto patched, automated backups, a lot of enterprise features that will help you withstand a disaster.
And then because we were using rate limiting and also the caching we wanted ElastiCache Redis back in, we used elastic cache for that, again, a managed service that scales out has and uses the write and replica technologies and then EC2 Auto Scaling as we added new nodes to the Kong cluster or there was say a hardware failure, in AWS we wanted to be able to replace those nodes or add new nodes as needed.
I think of one really cool thing is that’s often overlooked in AWS is the EC2 parameter store. It’s a great key value, secure string service that you can use to protect your data. And we actually use it for our database passwords, API keys, a lot of the sensitive information that we don’t want sitting out there in a repository or in our Terraform state files.
Again, another important pieces of plastic load balancing have that in multiple AZs to protect your services and scale out. Then CloudWatch, you need to be able to monitor and alert on the health of your services. We used, IAM for our instance profiles on the Kong nodes, so that they can then reach out and get access to various things like to set EC2 parameter store. So we’re not embedding keys in any of the nodes.
And then again, with security groups, least privileged model, everything that we did in AWS was really locked down to the specific things that needed access to it. So our load balancers can talk to a Kong node. Nothing else can talk to a Kong node, you can talk to the load balancer, Kong nodes can talk to the database, and everything is very secure and locked down.
And so this is what our architecture looked like and still looks like. So right in the middle, you’ll see a Kong cluster that has, again, the auto scaling group where the nodes in the cluster can scale out and scale back in as needed. We have an external load balancer to accept connections from the public internet and from other Kang clusters in different data centers. And we only expose that via SSL.
Internally, we have a load balancer that can do both HTTP and HTTPS depending on the service. And then for the Admin Gooey functionality and the enterprise edition, you can also access that via SSL and then the Admin API as well. And so the way we designed this was that we wanted to be completely transparent for the microservices. And so our consumers actually hit a Kong end point in their local VPC that adds the API key for them, forwards that to the remote Kong cluster, it validates the API key as that consumer and then forwards it onto the microservice. So that way, we in operations can seamlessly do key rotation without having to impact a service deploy or have developers reconfigure things.
So in terms of provisioning resources, infrastructure as code is the new buzzword, and it’s a great buzzword in terms of what it actually implies, and so what is it? It’s just machine readable configuration of your data center and this can be a script or a declarative definition of what you want your infrastructure to look like and how it should behave and the benefits to doing it this way is that you can then version it, you can put it in a repository and iterate on that and be able to revert back to a previous state as needed. It’s also shareable and that was very valuable to us at Zillow because we have multiple brands, multiple devops teams, and multiple people provisioning and it’s reusable, In that way, we didn’t have to reinvent the wheel at each step. We were always using the same code together, and then repeatable because even within a devops groups, we were deploying multiple Kong clusters and so we needed a way to ramp up quickly to meet the demands of our customers.
And so at Zillow, Terraform was our way of doing that. There are obviously other tools out there for AWS, but we had already standardized on Terraform and so that’s how we did Kong. And Terraform is just a tool that codifies a APIs into declarative configurations. And you can go to the website there are some of the additional benefits to Terraform is that it is open source just like Kong and has a great community behind it. Another benefit is that it can really manage your complex change sets. And so if you’re looking at introducing say a new resource or modifying an existing resource, Terraform can go look at what’s already existing in your VPC. Compare that to the changes you want to apply and only make those changes. It can also manage resource dependencies. So if you have a security group that depends on another security group, or if you have a EC2 instance that’s going to depend on a database, Terraform can help you easily manage those dependencies to make sure that resources are created, modified in a destroyed in the appropriate order.
So getting started with Terraform, you can just go down to their website, download it for free, it’s available for a number of different platforms, and you can use homebrew on the macOS to get started with that as well. You just unzip the binary, place it into a folder that’s in your path. And here’s some instructions to help you do that. Super easy to do. Then you just verify your installation. Here, I’m on my system called Awesome. I have another one called Booya, Terraform –version, and it will give you what you have currently installed. It’s actually important to take notice of this because Terraform is actually a project that iterates really quickly. They release new versions all the time and it’s important to stay up to date, you actually may find when you start collaborating with other people, if they’ve downloaded version after you and have something that’s more up to date, you’re going to have to upgrade in order to modify the state.
So getting started with Terraform, there are a number of different providers that you can use to basically describe provision change and destroy resources in your environment. And there was actually over 70 officially supported providers, AWS being one of those. It’s been brought up in other talks as well. There’s some community providers as well, and Kong being one of those. So a very flexible and powerful tool.
So Terraform configuration, that’s basically just text files. There is a Terraform format where it has a .tf extension, and this is actually the preferred way to describe your infrastructure because it’s human readable, you can add comments to it and the declarative format for that is HCL for Hashicorp’s Configuration Language. You can actually also do it with json with a tf.json extension. But this is really designed for applications that would be generating the Terraform for you in order to apply.But again, realizing the limitations of json, it’s not as human readable and you’re not going to have the comments.
So configuration semantics for Terraform is it basically looks at all the files in your directory, orders them alphabetically and merges them all together. And so if you say create a definition of a resource in multiple files, you’ll actually get an error on that merge because of the multiple declarations. There is a pattern for overriding. I’ll list that here for further reading on your part, but I’m not going to go into details, just so we can focus on Kong. So yes, Terraform is declarative, the order of a reference in the variables within the files don’t matter. It’s going to merge them all for you. So some basics, typically, you’ll lay out a directory with a main .tf, and this is where you specify your provider. This could be the AWS one where you’re going to give it some credentials, it could be the Kong one where you give it your Admin API token.
Then you define resources. And basically any file name that you want, typically would be it would be redis.tf or aurora.tf to describe the resource that you’re trying to provision. It’s just basically a component of infrastructure. And you can actually have multiple definitions within one file. A data source, typically stored in data.tf basically references an existing piece of infrastructure that wasn’t created within your Terraform directory. It may have been created by someone else in their Terraform directory, but instead of having to statically reference, say a resource ID like an ARN, you can use the data source to have it pull in that information for you, so that if you were to say rebuild a VPC and you get a new ARN, you’re not having to update those static references in each place.
And then variables are basically just parameters that you can specify. Think of any other programming language, even though HCL is not a programming language, it’s the same principle. And then lastly, modules and these are basically a Terraform directory of resources that are encapsulated into one group that you can be reusable.
And so we developed a module for provisioning our Kong clusters in AWS. And I’ve tried to make them pretty low barrier to entry in terms of its prerequisites. Obviously you’re deploying into AWS, you’re going to need a AWS account. We do everything in VPCs and so you’ll have to have your VPC setup. Then you’ll have public and private subnets, ones that are exposed to the Internet and one that are completely internal and this is actually a best practice for AWS so that you only expose things as needed. And we do labeling. We do a lot of tagging in our AWS accounts and this way, again, we can use data sources to reference those without having to say have a static reference to a subnet ID in there. And so for this module, you just have to label your private and a public subnets using the type tag.
And then a default VPC security group. This is going to be for giving you SSH access to the Kong nodes. So it allows you to choose how you want to do that, whether it’s, say a corporate subnet or a bastion host. And then you’re going to need, an SSH key pair for SSH into the Kong nodes, a SSL certificate for the HTTPS on load balancers. And this can actually be different for each one of the SSL endpoints. And then lastly, you just need Terraform, so hopefully very low barrier to get in. You need your AWS account setup and Terraform.
And so happy to announce that we’re releasing this open source project for everyone to share. It’s now Zillow Group/Kang-Terraform on github, and we’ve added it to the con hub that was released earlier today.
And so here’s an example of building your Kong cluster using our module. Pretty easy to do. And so this is the point where it’s like, “Wabam. You got your Kong cluster, I’m selling it on TV.” Like one of those, yeah, “It’s now yours for three easy payments.” And that last one is going to be super complex because don’t all have UCS majors feel robbed for never having using calculus throughout your entire career? So I made that last payment super complex so that you can apply some of that, feel like you get value for that CS degree. And so provisioning with Terraform easy as one, two, three, Terraform init, plan, and apply. And again, wabam, you’ve got your Kong cluster, and this is my Oprah moment. It’s like, “You get a cluster, you get a cluster, he gets a cluster.”
So while that’s actually applying, and setting up all the resources, you actually have to go and do some things. And so if you go into AWS console, into the parameter store, you’re going to want to add a password for your Kong cluster. And basically, you can set it to whatever you want. I don’t have that in any of the Terraform tf files because again, we don’t want that being committed to a repository and then being exposed to people that shouldn’t have it.
And the same thing with the license for EE or AM for the bin tray off. And this is only if you’re doing the EE edition and you can actually do CE and EE with this module. And so the Kong nodes are running minimal Ubuntu and I’m actually a super huge fan of it. Came from the Debian world long ago. But Ubuntu is really a modern operating system that’s built for the cloud. Minimal Ubuntu has a very small footprint. I think it’s under 100 megs. It’s very secure because again, we’re reducing the surface layer and scope of what’s being installed. It’s really fast booting. So I can provision an instance in 90 seconds from scratch. Again, it has an optimized kernel for AWS to give you even better performance.
The Kong service itself is installed for you. It’s supervised under a program called runit or optionally also called by it’s command line tool sv. And basically, this will manage the Kong process for you. So if somehow it crashes or fails, it will restart it for you. We’ve also added a Kong splunk plugin for logging. So a great segue from the previous talk where he’s talking about doing splunk logging, the splunk plugins now released today.
Then automatic local log file rotation. For our declarative management of the API endpoints, we started out with Kongfig, and this was actually before the Terraform provider for Kong was released. And so that also gets installed.
And so some of the cool hacks I saw were interesting when I created this module was how do I do ELB, elastic load balancing health checks? And this actually became the first Kong endpoint. And so you have a slash status on the Admin API. For CE, we didn’t expose our admin API on any of our load balancers. And so I’m like, “I’ll just create a slash status on the Kong gateway that points to the local host status. And so that way I can do health checks.”
Then enterprise version was just a modification of that because if, again, if you look at, can’t really do off when I’m just giving it an HTTP endpoint for our status. So with enterprise, our back is enabled by default and what I did was created to monitor user, with it a token that has access to slash status. Then we create the Kong slash status endpoint and modify it using the request transformer plugin to add the Kong admin token for the monitor user, which is just monitor.
And so once you’ve provisioned your cluster, there are some additional steps you want to take, by default, the root password is just KongChangeMeNow#1. And so you’ll want to log into one of your instances and change it. It’s not too much of a security threat because only the Kong instances themselves have access to that PostgreSQL database, but it’s definitely a good practice. You can then update that root password in the EC2 parameters store. So as you provision new Kong nodes, it has up-to-date a configuration. Also, you’ll want to enable IP white listing on that slash status endpoint so that way you’re not exposing it to the public on your external load balancer. And then for enterprise edition, the default admin user in our back is just zg-kong-2-1. And obviously that’s going to rev with each version, but you’ll then can log into the Admin Gooey and change that.
I’m not sure if the Kong Terraform provider can do our back yet, but Kongfig can’t, and so, we just do that manually through the Gooey. You’ll also then want to update that value in the EC2 parameters store because that admin token can be used for your declarative definitions of Kong endpoints. Some additional features about this plugin is that almost all the settings are tweakable. You can change your EC2 instant sizes, your timeouts, your thresholds, and also resources can be optionally provision. So if you’re not using Redis, you don’t have to enable Redis. Say you have an existing PostgreSQL database that you want to use, you don’t have to provision Aurora.
Then a big thing for us is CloudWatch. And so there are CloudWatch actions that you can define that will little trigger on the various thresholds that can be tweaked.
And so this allows you to send email or send a pager duty if a Kong node goes down, you’re hitting four or five x 100 thresholds. You can also add bastion host access to all the resources since everything is locked down. Say you want to manage the PostgreSQL database outside of that, you can add that to the bastion hosts cider blocks.
Some recommendations for running in production would be to use the C52XLs, this is actually Kong’s recommendation as well. And if you look at it, you’re going to see your host running maybe 3 to 5% CPU and like 800 megs of RAM of the 16 gigs I think it has. And you’re like, “Why am I doing this?” It’s for the networking. If you look at AWS instance sizes, you have to evaluate not only the CPU and memory that you need to provision for, but the network is very important.
If you look at those T2 instances, you’re going to get burstable network speeds that are going to be sporadic and they’re not very a kind for production workloads. And so this, the C52XL gets you into that 10 Gig range, which will scale for a production. Obviously you’re going to want to look at those CloudWatch blogs and then be able to implement auto scaling policies about when you shrink down and when you go big for your production workloads. And then it also allows you to add additional tags to all of the AWS resources. When they get provisioned, they’re going to have a service and description tag and basically when you provision the services, the name of it, it’s going to be zg-kong-2-1- whatever environment you define. So dev, staging, prod. But then you can also go in and add additional tags in the module itself, which will be passed to every single resource that gets created. And this will help you with your auditing and billing.
Then highly recommend you send those Kong logs to a remote endpoint. Mentioned splunk and that’s what we use. And it really enables you and developers to have visibility into the health of the cluster and the health of the application. And so our teams heavily rely on this to look at and monitor and alert on the health of their application. We use it to do the same for the clusters. And one really cool thing is when we release the splunk plugin at Zillow, everyone immediately went in and started looking at like latencies and they were just blown away. And I’m really impressed with how performance Kong really is a to see an average latency is zero milliseconds on processing and a really low p99, I think it was like 4, it was impressive and it just boosted everyone’s confidence in Kong
Then lastly, we talked about API key management, and because we’re again putting this in declarative form to where you don’t know what that key is. We actually use pwjen, it’s a little command line utility for Linux and for Mac, to generate secure passwords so that, that’s what the -s does, is it makes it more secure. You can then specify the length. And so for API keys we use 32. Then the 1 just says give me one password. This way, as you go into the declarative world for your API endpoints, you can have when they give you, say a pull request to create a new end point and there’s an API key that’s supposed to be in there. They can actually just provide you a token, you generate a value for that token and using pwjen, and then you can store it in something like vault or EC2 parameters store and have that automatically inserted into a render config as so applied.
Some thoughts on EPI management, we’re using Kongfig currently. Again, it was because the Terraform provider didn’t exist at the time, but it does have some limitations, that doesn’t support services and routes and the newer versions of Kong. And the development of it has really slowed down. And so we’ve been looking at some alternatives. Terraform I think is going to be the go to in the future because then also puts it into the same declarative language that we use to revision Kong itself. And then there’s also ansible.
Then you can set up policies to make this completely self service so that your customers can then send you a pull request for the end point that they want to create or modify, and then as you merge to master, automatically apply it to the appropriate cluster. So some ideas that I’ve had for the future, again, this is going to increase the the entry into using our Kong cluster, but we’re looking at maybe using a custom AMI, using HashiCorp’s Packer, and this would basically allow us to configure Kong and a lot of the static content before it even boots. And that way we’re only applying database passwords and some of the dynamic stuff configuration as it comes up.
Stats d integrations so that we can get more metrics on the Kong nodes themselves, an additional CloudWatch triggers so that we can alarm on CPU and memory and disk. Then also provide some example auto scaling policies. We’re still in the process of setting that up ourselves. Right now we, we provision however many Kong nodes we think that we need for that given environment and we just keep it statically there. That auto scaling policies replace nodes as needed as say AWS changes hardware underneath us, but we’ve really haven’t had a need to really scale that out yet because Kong is so performant that just haven’t had the need to scale beyond the minimum amount that we feel we need for reliability.
Then we’re also evaluating the PerimeterX plugin, PerimeterX is an enhanced bot protection framework. We use it on our primary website and then to be able to offer that within our Kong service as well would be a great compliment to it. And then instead of having all of our different con clusters talk to each other over the internet, we’re thinking about setting up, say a transit Kong VPC where there’s only Kong in it and we don’t mind peering with all of our different brands because at some level we do trust them and that way we’re getting better performance at the VPC level.
And then with the exciting announcement of 1.0, definitely looking into the service mesh a opportunities there, but even before the Sidecar proxied was released, I was thinking, what if we just got rid of all of our load balancers in front of our service and just had Kong? Because then you would have all of those great features available to you and really basically have a service mesh within your data center without going through refactoring all these things are adding sidecar capabilities even though it’s out in the network and not in the sidecar itself. All those features still exist and are available to you immediately before, say, migrating to Coobernetti’s or changing out your entire stack.
So big thanks to some people at Zillow Group. Toby Roberts, the VP of Operations. Leif Halverson, the Director of Infrastructure who’s my supervisor, were great supporters of our transition to Kong Enterprise and had no problems justifying that expense. Jan Grant, who’s our PM for the project and kept me on a task with all of those Jira tickets, and then my own teams. And, this isn’t just production operations at Zillow because I’m in Seattle, but also the other brands here, I’ve got some guys sitting up here in the front row who very prominent in the implementation of all of this.
Not all the product teams we partner with. It was exciting to do this with a bunch of teams that were eager to onboard and it was just a really fun process. And then all the people at Kong headquarters, you really struggled to find in this industry people that are just so bright, pleasant and fun to work with. Danny, Aaron, Harry on the cloud team like Ben Helves also, our customer success engineer, Travis, I could go on and then all of you Kongers because I think Kong is such an awesome technology, but then to back it up with such a vibrant and cool community, my hat’s off to all you guys.
Thank you.
The post Kong with Terraform: A Field of Dreams appeared first on KongHQ.