Deps is a private, hosted, Maven repository. While building it I had a critical decision to make - Where to host it? I evaluated many different hosting services but decided on Google Cloud Platform (GCP). I’ve been using it in pre-production and then production for over a year and a half now. I haven’t seen too many experience reports on Google Cloud, so I wanted to share how I’ve found it, what went well, and what still needs improving. I’ve split my thoughts below into good, meh, bad, ugly, and opportunities for improvement. I have compared and contrasted with Amazon Web Services (AWS), the other hosting provider that I have the most experience with, and GCP’s biggest competitor.
A note up front, these are solely my experiences, and it’s quite possible that I’ve misunderstood or misrepresented things here. If I’ve made any mistakes, let me know so I can correct them. I only talk about services that I have experience using. There are a bunch of really good looking services like Google Kubernetes Engine, Google App Engine, and BigQuery, but I haven’t used them enough (or at all) to be able to give a review on them.
Google Cloud’s permission model is one of its strongest features. AWS IAM permissions are fairly complicated; in practice people often set permissions to
s3:* or equivalent to make things work. For this reason (and others) there has become a ‘best-practice’ of running multiple AWS accounts for dev, staging, and prod, and maybe more environments. AWS seems to be okay with this situation and is leaning into it, offering AWS Landing Zone and Organisations to help orchestrate multiple accounts. This works, but it seems like it adds a lot of complexity.
In contrast, GCP offers a much simpler starting point: Projects. Every resource you create (I think?) lives in a Project, and Projects live inside your (single) Organisation. By default, resources within a Project are accessible to other resources in that Project (as long as their API is enabled) and are inaccessible to anything outside of the project. For example, if you create a Cloud SQL database and a Cloud Storage bucket, by default a VM inside the project can access both resources, but a VM outside of the project couldn’t. This goes a long way towards setting up safe permission structures. If you need to, you can setup inter-project resource access, but it’s not something you’re likely to need too much. In my experience, I needed it for accessing disk images and DNS records, but everything else for Deps’ production service lives isolated in the production project.
Google’s Organisation management offers hierarchical folders with cascading permissions. I haven’t needed to use them, but it seems like this would scale well to very large organisations. I wonder if this is modelled on the way Google’s internal permission system works?
GCP has quite a different product philosophy to AWS. When new GCP features and resources are released into general availability, they are usually very high quality. This is in contrast to AWS where it can sometimes feel like you are the first person to use a feature. A quote I have seen which rings true to me is “Google’s Beta is like AWS’s GA”.
GCP also has done well with integrating their different services together. GCP provides a smaller set of core primitives that are global and work well for lots of use cases. Pub/Sub is probably the best example I have for this. In AWS you have SQS, SNS, Amazon MQ, Kinesis Data Streams, Kinesis Data Firehose, DynamoDB Streams, and maybe another queueing service by the time you read this post. GCP has Pub/Sub. Pub/Sub seems flexible enough to replace most (all?) of AWS’ various queues. (Disclaimer, I haven’t used Pub/Sub yet, just looked at its documentation).
Google opts for strong consistency by default. Google Cloud Storage, has consistent lists and gets, Cloud Key Management Service has strong (global) consistent key enablement, and Cloud Spanner is their globe-spanning consistent database. Many AWS services are eventually consistent, which pushes complexity onto the developer.
Even though Google has a smaller set of primitives than AWS, they are often simpler. Rather than having dozens of compute instance types, they offer just four: micro, standard, highmem, and highcpu. In actual fact, these are all just pre-configurations in the design space, and you can size and configure your instances to be just about any combination of memory and CPU.
Another small thing, but one that I appreciated is that GCP’s resource names are often aesthetically prettier. Either you get to name resources yourself (I’m looking at you
jfkjfkfjlkjfjfkak.cloudfront.net), or the generated names are short, e.g.
Global and Regional by default
One of the nicest things about GCP is that most resources are either global or regional. This includes things like the control panel (you can see all of your project’s VMs on a single screen), disk images, storage buckets (multi-region within a continent), network configuration, global load balancing, Pub/Sub, VPC networks, and probably more I’m forgetting. This contrasts with AWS, where most resources, including the control panel, are either zonal or regional.
Google Compute Engine has been very solid for me. I’m only using it for a handful of VMs but I haven’t noticed any issues with it. The instance group managers work well and auto-scaling does what it’s meant to do. Health checks can be used to destroy unhealthy instances and create new ones without manual intervention (this has pros and cons, as you may lose debugging information). Rolling updates work well for doing deployments. The Rolling Update manager waits for new instances to be healthy before shutting down old instances. This has helped catch a few issues before they hit production. If your instance is unhealthy, GCP provides live instance migration. It’s nice, it seems to happen once/twice a week, and I’ve never noticed any issues with it. On AWS, maintenance events will require a reboot.
Google’s pricing model is much simpler. As mentioned above, you can preconfigure your machines to just about any combination of memory and CPU that makes sense for your application. You can also pick your Intel CPU instance family if you need certain CPU features, or just want the slight performance boost available.
For standard VMs, Google offers Sustained Use Discounts. If you run a standard VM for more than 25% of a month, Compute Engine automatically discounts your bill. If you run an instance for a full month this ends up to be a 30% discount. This is a nice benefit if you don’t want to pre-purchase capacity, but still have a stable workload. They even do something neat with ‘inferred instances’ where they bin-pack partial instance usage. This means you don’t lose your discount if you replace instances through the month, and ends up giving you the maximum possible discount. I can’t do it justice here, check out the docs, it’s really cool.
Google’s answer to AWS spot pricing is Preemptible Instances. Instead of bidding on instances and working out the maximum price you’re willing to pay, they offer a single, fixed 80% discount. One thing that is trickier with Preemptible Instances is that you only get a 30-second warning before your instance is preempted. On AWS you get two minutes. The tradeoff is that both pre-emptible and standard instance startup times are faster because Google can evict Preemptible Instances more quickly. On AWS, after submitting a spot request you may have to wait a short time before your spot request is fulfilled.
Lastly, Google has an answer to Amazon’s Reserved Instances, and again they are much simpler. To use Committed Use Discounts, you purchase vCPUs and memory capacity separately. Your purchase is tied to a region but is otherwise convertible to just about any instance configuration (see the docs for a few small limitations). Committed Use Discounts are automatically applied to the instances you are running in your workload, and then Sustained Use Discounts are applied to any resource usage on top of that.
Picking Terraform has been a really big win for Deps. It enables us to quickly and safely spin up a complete staging or development environment to test a risky infrastructure change. It also ensures that changes to production are tracked, commented, and coherent as a whole. Terraform takes a little bit of learning, but the Google Cloud Provider docs are really good, and the team working on the provider is constantly updating it with new features and resources in GCP. If you want to be on the bleeding edge of every GCP feature then Terraform may not be a good fit for you, as they don’t generally support beta resources. However you could manage those beta resources via the console/CLI/Deployment Manager until Terraform support is added, then import them into your Terraform config.
The one place I think Terraform has room to improve is in running multiple environments from a single configuration, e.g. dev, staging, prod. Workspaces are good, but there is limited support for switching variables based on your workspace.
Networking and Firewalls
Network and firewall configuration was simple and easy. The default networking decisions seemed good to me. However, I’m not running enough instances to run into any problems here. Google’s networks are global by default and enable inter-region communication with no extra setup or costs. On AWS, you need to run a NAT instance or NAT gateway for inter-region VPC communication.
Google Cloud Slack
The Google Cloud Platform Slack has been invaluable. There are many GCP engineers in the channels for their products. The Googlers and customers are both very helpful. You can glean bits of inside gossip, survey invites, and early access to beta programs. It is also helpful for debugging whether a problem is specific to your infrastructure, or whether it is a wider problem affecting others. More on that later.
Google’s console looks much nicer than AWS. Because it is tied to my Google account, I don’t need to log-in with my 2FA key every day, where I do with AWS. The Console is well designed and laid out in a logical manner. It helps in this case that Google doesn’t have the same number of services that AWS has. You can pin any number of services that you use to a sidebar menu. I imagine as their offering grows, they may need to redesign the sidebar further. The Console is global, not regional, so you can see resources across all regions in a single ‘pane of glass’ for a project (apart from acquired companies, more below). No more wondering “Did I leave an instance running in us-east-2?”.
Google Cloud Storage (GCS)
I almost forgot to put GCS in this article because it is so reliable that it fades into the background. Most of the time I don’t really think about it. Multi-region storage is really nice if you can wear the slight latency hit. It makes it easier to run in multiple regions, something that is in Deps’ future plans to improve availability. GCS supports the S3 API without having to run an S3 Proxy like Azure.
Logging is fast, and easy to query. I’ve had good experiences with this, although my needs have been fairly modest. All of the GCP services emit structured logs which is easy to query against. One downside is that log exports to GCS aren’t signed, and it doesn’t seem like you can verify that logs haven’t been tampered with. If anyone knows of a way to verify the logs, I’d love to know about it.
Google has a public issue tracker at https://issuetracker.google.com. Google employees mostly respond to issues with a “thanks for your feedback, we’ll take it under advisement” but it’s good that it exists, and sometimes they will ask clarifying questions, which shows that they do care. They also have a UserVoice for feedback and many mailing lists. It would be nice if there was a unified view of all of the mailing list/forums like AWS has. It’s not always clear what mailing lists are available. Update: jjjjoe on Hacker News pointed me to https://cloud.google.com/support/docs/groups which aggregates all of the mailing lists available.
Security was one of the main reasons that I chose a major cloud provider over a more niche host. Security is baked into everything Google does, starting with securing your Google account. Google has a good whitepaper covering their encryption in transit. It is nice that cross-region traffic is encrypted, as that’s one less thing to have to set up that you would need to do on AWS. Most services offer encryption by default, and your only choice is if you use Google’s keys or your own. I was recently setting up some S3 buckets and was surprised to see they still offered the option to have unencrypted buckets.
Google’s Metadata service requires adding a specific HTTP header for it to respond. This prevents Server Side Request Forgery, which is welcome.
These are services that I’ve had an OK experience with. I wouldn’t strongly recommend them, but they’re not bad either.
Container Builder is a tool for building containers and running Continous Integration (CI). Because it was a container building tool first, it can be a bit tricky to understand how you can use it to run CI. It is missing a lot of features that come standard with other CI tools like Circle CI or Travis CI, e.g. build caching, GitHub status notifications, Slack test failure notifications, running a sidecar container for a database. None of these individually is too hard to roll yourself, but it would be nicer if they came built-in.
Cloudbuild has a shared, free, pool of n1-standard1 instances that are kept running and are available to start running builds straight away. They also offer high-CPU VMs that you can pay to use, but they only boot up when your build starts. I tried out the high-CPU instances to run faster builds, but including the instance boot time, test time was a wash, so I stayed on the n1-standard1’s.
In an earlier draft Billing was in the Bad section, but GCP recently released reports for billing. These are more limited than what AWS offers, but get me the information I was after. They also offer billing integration with Data Studio if you want to drill in deeper. AWS built-in dashboard is still the winner here, with a lot more flexibility and pre-built reports.
There is no way to set billing alerts for estimated usage, only actual usage. So if you want to keep track of your spend throughout the month and find out early if something is going rogue, you end up setting up 25%, 50%, 75% budget alerts. When you receive them, you then check how far through the month you are. In the last few days they have released billing forecasts, so hopefully billing alerts will be able to be based on forecasted spend, not just actual spend.
While GCP services exhibit strong consistency, I can’t always say the same thing for the documentation. You can sometimes see two pages disagree, e.g. HTTP/2 server push is both supported and not supported:
The load balancer acts as an HTTP/2 to HTTP/1.1 translation layer, which means that the web servers always see and respond to HTTP/1.1 requests, but that requests from the browser can be HTTP/1.0, HTTP/1.1, or HTTP/2. HTTP/2 server push is not supported. - Setting up HTTP(S) Load Balancing
http2_server_push_canceled_invalid_response_codeThe load balancer canceled the HTTP/2 server push because the backend returned an invalid response code. Can only happen when using http2 to the backend. Client will receive a RST_STREAM containing INTERNAL_ERROR. - HTTP(S) Load Balancing Logging and Monitoring
The documentation will sometimes make assertions or tell you not to exceed certain limits, without telling you why, or what will happen if you go over the limit. An example of this is the Spanner split size:
As a rule of thumb, the size of every set of related rows in a hierarchy of parent-child tables should be less than a few GiB.
GCP docs sometimes miss information on the interactions between components, e.g. preemptible instances, autoscaling, rolling updates, and the HTTP load balancer. Sometimes the docs will give you one sentence, and leave you to figure out all of the implications. AWS docs can be overly verbose, but they are usually quite good at documenting integration with other features and services.
Sometimes when browsing documentation for a service or API you will find that the old way is deprecated, but the new way that they recommend you use is still in beta or alpha (!).
I have reported bugs/clarifications against AWS docs and got prompt feedback and even requests for clarification from AWS team members. This has never happened for my comments submitted against Google’s documentation.
Observability is a little bit difficult with autoscaling and health checks. Some of the different kinds of health checks aren’t logged anywhere as far as I can tell. This situation seems to be improving though, recently autoscaling log explanations were added. There is also no easy way to tie autoscaling logs into Slack notifications. If you want to do this, you’ll need to create a log export to PubSub which will trigger a Cloud Function (equivalent to an AWS Lambda).
As far as I can tell, you can’t get notifications before Cloud SQL’s maintenance is run, it just runs. You can specify when the maintenance window would run for that week if it were to run.
I had a bit of trouble dealing with dependency conflicts between different Java client libraries. Each wanted a different version of a set of common dependencies. I’m sure there must be a good reason, but several libraries come with a dependency on
com.google.guava/guava-jdk5 which you’ll want to exclude and instead use an up-to-date Guava version.
The API libraries and tools are spread across several GitHub organisations including GoogleCloudPlatform, Google, and possibly others, which can make it a little difficult sometimes to track down the definition of something.
Stackdriver Trace has been a bit of a pain to work with. The Java SDK has never been very well documented, there are several different minimally documented API versions available. Most recently it seems like the recommended approach is instead to use OpenCensus to instrument your code.
Stackdriver provides two agents to capture logs and metrics. Log capture is provided through a fluentd plugin and metrics via collectd. When I tried running Deps on g1-small instances, I found unexplained high CPU spikes from collectd and missing logs one time. Since moving back to n1-standard1 instances I haven’t seen any of these issues.
Cloud SQL proxy
Google Cloud SQL has many options to connect to a Postgres SQL instance, but not the one that you probably want - access through your VPC. Instead, you will probably end up running the Cloud SQL Proxy on your VM which opens a tunnel to your SQL instance.
When using the Cloud Shell, it uses temporary IP whitelisting. However, I found that it took so long for the whitelisting to be applied (1-2 minutes) that I usually gave up and chose another method to connect.
The pace of improvements feels very slow compared to AWS. Announcements made last year have taken a long time to come out. For example, Customer Reliability Engineering and a new approach to support were announced in 2017, and their new support offering has only just become available. It just feels like a much smaller operation than AWS. There’s no time to waste, AWS is accelerating, and it doesn’t feel like GCP is keeping up.
Currently, GCP is missing parity with a lot of AWS services. While drafting this earlier in the year I noted that they were missing caching, WAF/DDoS protection and a low-latency key-value store (that doesn’t start at $700/month). However, since then, they have announced Cloud Memorystore for Redis and Cloud Armor for WAF/DDoS protection. I’m looking forward to seeing if anything else will be announced at Google Cloud Next in a few weeks.
I paid for Silver support (the lowest paid tier) for a few months while I was working through some issues. Support was often not very helpful on the first interaction and didn’t really seem to understand the problem. I needed to be quite persistent to communicate what the problem was. I’m not sure if paying for higher tiers would have helped here?
Getting Booted by The Algorithm
I haven’t experienced this myself, but I recently saw a harrowing tale from a GCP customer who had their entire project shut down by Google’s fraud protection system, with full deletion scheduled for 3 business days later. It wasn’t clear how much the customer was at fault here, but this response seemed disproportionate. This isn’t the only time I’ve seen this kind of thing happen from Google.
A Customer Engineer at Google Cloud commented that if you were worried about this scenario that you set up an invoiced billing with Google Cloud. I was worried about this, so inquired about setting it up. The response I got back was:
Prior to applying for monthly invoicing, please review the following minimum requirements to determine if you are eligible to apply. These requirements include, but are not limited to:
- Being registered as a business for a minimum of one year.
- Spending a minimum of $2,500 a month for the last 3 months.
I’m not sure why they didn’t tell me the full list of requirements (“These requirements include, but are not limited to”), but I’m not spending $2.5k/month, so this wasn’t an option for me. Google does let you add a second credit card in case your primary one has a problem. I think I just have to keep my fingers crossed that I don’t run afoul of The Algorithm.
I’m now using EnvKey (which is excellent by the way), but it would be nice to have an easy way to store and retrieve sensitive secrets built into GCP. There is a recipe for setting it up yourself with a bucket and encrypting the data with KMS, but it’s a bit of a pain, and it would be much nicer to just have an API to call which both stores and encrypts the data, ala AWS Secrets Manager.
Google Cloud requires you to enable access to an API before you can use it. This can sometimes take a while. If you only use the Console to administer your resources, you’re unlikely to run into much trouble, as these APIs are enabled on first use. When using Terraform to rebuild testing environments, deleting and recreating these services turned out to be the longest pole in the tent. There is a new
disable_on_destroy = "false" option for the google_project_services Terraform resource which lets you keep the services around when destroying all resources, which is helpful. Occasionally I notice that Google has renamed or split an API service into multiple pieces. When I run Terraform, this step will fail because it wants to re-add an API which no longer exists.
It would be easier if I didn’t need to think about this API access, but I assume that it is there for good reasons, probably involving capacity planning.
While Google has done well at integrating core infrastructure, acquisitions have not gone so well. There should be a single console for all GCP resources. Instead, you have the Console, Stackdriver, Firebase, a separate Support portal, BigQuery, and Zync, all living in different places.
I have only had two really bad experiences on Google Cloud Platform.
Stackdriver Monitoring has been my biggest disappointment with Google Cloud Platform. It apparently reuses Google’s internal infrastructure, though I’m not sure which parts, and I doubt that it includes the front-end. I’ve had a number of issues using Stackdriver Monitoring:
- The minimum time you can set before a failing Uptime check becomes unhealthy is five minutes. It then takes an additional five minutes for that error notification to be sent via email or PagerDuty integration. This means that it will take ~ten minutes from the time a production service stops responding, to the time you are notified. Good luck maintaining your SLO’s with that kind of delay. Apparently this is working as designed.
- Uptime checks take around 25 minutes to actually start checking after you define them. You can preview an uptime check result before starting it, but it is still frustrating to have to wait so long for the uptime checks to start firing. The only other service I’ve ever used that behaved like this was CloudFront, which also has extremely long update/creation times.
- The Stackdriver Monitoring console is separate from the rest of the Google Cloud Console and requires you to re-auth to enter it. The layout is a bit confusing, and not that intuitive to navigate. I sometimes got UI glitches when configuring things that required a page refresh.
- As best as I can tell, there is no way to access history for an instance once it is destroyed. This doesn’t work well with an autoscaling environment if you want to find out why an instance was destroyed, or what was going on when it was unhealthy.
- Charts aren’t labelled with units, which means it’s often unclear what is being measured, e.g. JVM GC count is “0.013”. I think that’s per second, but I’m not really sure.
- The previous pricing model was not very good. If you wanted Slack or Pagerduty integration you needed to pay for Stackdriver Premium. This cost $8/resource/month. They have just moved to a new pricing model which is priced based on usage. This is a good move, as I was only paying the extra amount for Stackdriver Premium so I could get longer log retention and Slack notifications.
HTTP Load Balancer
The HTTP Load Balancer sounds like a magical experience. You get a single IP address and your customer traffic will enter Google’s network at the closest location to them. Then it will transfer over Google’s premium dark fibre until it is routed to the closest region you are running in. This is a shared resource among all of Google’s customers and Google itself and doesn’t need to be warmed up.
However, we had an extremely annoying issue where external traffic would get very occasional (0-3 daily) 502 errors returned by the load balancer without even contacting my instances. This is more of a problem for Deps than a standard web application, as if this 502 is served to a Maven client downloading dependencies, it will fail the entire download process.
There is a field on the load balancer logs which says which backend caused the 502 failure, but this field is always blank. Not helpful. I tried contacting support about this. At first, they said that it was a problem at my end with misconfigured keepalive settings. Then they said that the small number of 502’s within an acceptable range (even when I had very low traffic volumes). Luckily I found other people had the same issue on Slack and the GCE mailing list, so I knew I wasn’t going crazy.
This was probably my most negative experience with GCP. Eventually, the 502 errors went away, but it was frustrating that it was never acknowledged.
I can see a number of opportunities for Google Cloud to improve. Preemptible instances are a great model, I wish they were better suited for serving web traffic. AWS let you create a pool of spot and non-spot instances so that if your spot instances are outbid your service doesn’t drop off the internet.
I would love to see more of Google’s proprietary technology made available, particularly around operations and monitoring. Google has a strong brand in Site Reliability Engineering, but their tools are still weak to middling in this area. Another service that would be very handy is a hosted etcd/Zookeeper service for service discovery, consensus, leader election, and distributed cron jobs.
Google has a very bad reputation around algorithmic actions shutting down or locking out customer accounts. This seems like an unforced error. They would do well to reverse this policy, make a statement about it, and put in place something a little more humane.
Google has (fairly or unfairly) become notorious for shutting down services and APIs that are infrequently used. To the best of my knowledge AWS has never done this. Amazon SimpleDB is de-emphasised, but it’s still available for use if you want it. Making a public committment to not shut down APIs and services would help Google with developer and corporate mindshare.
AWS has been very poor at contributing back to the open source projects that they use. I’ve been pleased to see Google open source a lot of their work lately, notably Kubernetes, as well as smaller projects like gVisor and Jib. More work contributing back to open source projects they run like Redis, Postgres, and MySQL would help them improve developer mindshare.
Google Cloud has created a compelling offering, with a mix of rock-solid infrastructure, plus unique value-added products like Spanner, Pub/Sub, and Global Load Balancing. They’ve been able to learn from what AWS got right and wrong. Their products integrate well together and are simple to understand. The downside to Google’s more deliberate approach is that it can sometimes feel like AWS is not just ahead of GCP, but accelerating away. I’m hopeful that the upcoming Google Cloud Next will bring more parity with AWS’ offerings.
For companies that don’t want to spend a lot of time learning and dealing with the complexities of AWS, I recommend looking at Google Cloud. If I had to start all over again, I would still happily choose Google Cloud.