Entries in cloud (7)

Sunday
Nov202011

Building an Application upon Riak - Part 1

For the past few months some of my colleagues and I have been developing an application with Riak as the primary persistent data store.  This has been a very interesting journey from beginning to now.  I wanted to take a few minute and write a quick "off the top of my head" post about some of the things we learned along the way.  In writing this I realized that our journey breaks down into a handful of categories:
  • Making the Decision
  • Learning
  • Operating
  • Scaling
  • Mistakes
We made the decision to use Riak around January of 2011 for our application.  We looked at HBase, Cassandra, Riak, MySQL, Postgres, MongoDB, Oracle, and a few others.  There were a lot of things we didn’t know about our application back then.  This is a very important point.

In any event, I’ll not bore you with all the details but we chose Riak.  We originally chose it because we felt it would be easy to manage as our data volume grew as well as because published benchmarks looked very promising, we wanted something based on the dynamo model, adjustable CAP properties per “bucket”, speed, our “schema”, data volume capacity plan, data model, and a few other things.

Some of the Stack Details

The primary programming language for our project is Scala.  There is no reasonable scala client at the moment that is kept up to date for Riak so we use the Java client.

We are running our application (a rather interesting business analytics platform if I do say so myself) on AWS using Ubuntu images.

We do all of our configuration management, cloud instance management, monitoring harnesses, maintenance, EC2 instance management, and much more with Opscode Chef.  But, that’s a whole other story.

We are currently running Riak 1.0.1 and will get to 1.0.2 soon.  We started on 0.12.0 I think it was... maybe 0.13.0.  I’ll have to go back and check.

On to some of the learning (and mistakes)

Up and Running - Getting started with Riak is very easy, very affordable, and covered well in the documentation.  Honestly, it couldn't be much easier.  But then... things get a bit more interesting.

REST ye not - Riak allows you to use a REST API over HTTP to interact with the data store.  This is really nice for getting started.  It’s really slow for actually building your applications.  This was one of the first easy buttons we de-commissioned.  We had to move to the protocol buffers interface for everything.  In hind sight this makes sense but we really did originally expect to get more out of the REST interface.  It was completely not usable in our case.

Balancing the Load - Riak doesn’t do much for you when it comes to load balancing your various types of requests.  We settled, courtesy of our crafty operations team on an on application node haproxy to shuttle requests to and from the various nodes.  Let me warn you.  This has worked for us but there be demons here!  The configuration details of running HA proxy to Riak are about as clear as mud and there isn’t much help to be found at the moment.  This was one of those moments over time that I really wished for the client to be a bit smarter.

Now, when nodes start dying, getting to busy, or whatever might come up you’ll be relying on your proxy (haproxy or otherwise) to handle this for you.  We don’t consider ourselves done at all on this point but we’ll get there.

Link Walking (err.. Ambling) - We modeled much of our early data relationships using link walking.  The learning?  S-L-O-W.  Had to remove it completely.  Play with it but don’t plan on using this in production out of the gate.  I think there is much potential here and we’ll be returning to this feature for some less latency sensitive work I perhaps.  Time will tell...

Watchoo Lookin’ for?! Riak Search - When we stared search was a separate project.  But, we knew we would have a use for search in our application.  So, we did everything we could to plan ahead for that fact.  But, by the time we were really getting all hot and heavy (post 1.0.0 deployment) we were finding our a few very interesting things about search.  It's VERY slow when you have a large result set.  It's just the nature of the way it's implemented.  If you think your search result set will return > 2000 items then think long and hard about using Riak's search functions for your primary search. This is, again, one of those things we’ve pulled back on quite a bit. But, the most important bits of learning were to:
  • Keep Results Sets small
  • Use Inline fields (this helped us a lot)
  • Realize that searches run on ONE physical node and one vnode and WILL block (we didn’t really feel this until data really started growing from 100’s of 1000’s of “facets” to millions.
At this point, we are doing everything that we can to minimize the use of search in our application and where we do use it we’re limiting the result sets in various ways and using inline fields pretty successfully.  In any event, just remember Riak Search (stand alone or bundled post 1.0.0 is NOT a high performance search engine).  Again, this seems obvious now but we did design around a bit and had higher hopes.
 
OMG It’s broken what’s wrong - The error codes in the early version of Riak we used were useless to us and because we did not start w/ an enterprise support contract it was difficult sometimes to get help.  Thankfully, this has improved a lot over time.

Mailing List / IRC dosey-do - Dust off your IRC client and sub to the mailing list.  They are great and the Basho Team takes responding there very seriously.  We got help countless times this way.  Thanks team Basho!

I/O - It’s not easy to run Riak on AWS.  It loves I/O.  To be fair, they say this loud and clear so that’s my problem.   We originally tried fancy EBS setup to speed it up and make it persistent.  In the end we ditched all that and went ephemeral.  It was dramatically more stable for us overall.

Search Indexes (aka Pain) - Want to re-index?  Dump your data and reload.  Ouch.  Enough said.  We are working around this in a variety of ways but I have to believe this will change.

Basho Enterprise Support - Awesome.  These guys know their shit.  Once you become an enterprise customer they work very hard to help you.  For a real world production application you want Enterprise support via the licensing model.  Thanks again Basho!

The learning curve - It is a significant change for people to think in an eventually consistent distributed key value or distributed async application terms.  Having Riak under the hood means you NEED to think this way.  It requires a shifted mindset that, frankly, not a lot of people have today.  Build this fact into your dev cycle time or prepare to spend a lot of late nights.

Epiphany - One of the developers at work recently had an epiphany (or maybe we all had a group epiphany).  Riak is a distributed key value data store.  It is a VERY good one.  It’s not a search engine.  It’s not a relational database.  It’s not a graph database.  Etc.. etc..  Let me repeat.   Riak is an EXCELLENT distributed key value data store.  Use it as such.  Since we all had this revelation and adjusted things to take advantage of the fact life has been increasingly nice day by day.  Performance is up.  Throughput is up.  Things are scaling as expected.

In Summary - Reading back through this I felt it came off a bit negative.  That's not really fair though.  We're talking about nearly a year of learning.  I love Riak overall and I would definitely use it again.  It's not easy and you really need to make sure the context is correct (as with any database).  I think team Basho is just getting started but are off to a very strong start indeed.  I still believe Riak will really show it's stripes as we started to scale the application.  We have an excellent foundation upon which to build and our application is currently humming along and growing nicely.

I could not have even come close to getting where we are right now with the app we are working on without a good team as well.  You need a good devops-like team to build complex distributed web applications.

Lastly and this is the real summary, Riak is a very good key value data store.  The rest it can do is neat but for now, I'd recommend using it as a KV datastore.

I'm pretty open to the fact that even with several months of intense development and near ready product under our belt we also are only scratching the surface.

What I'll talk about next is the stack, the choices we've made for developing a distributed scala based app, and how those choices have played out.

Sunday
Aug072011

Can New Clouds Teach Old Apps New Tricks?

Cramming the same old code, CMS, application, etc into the cloud (any cloud) doesn't make the most of the capabilities of cloud computing in all it's various forms.  I expect to be discussion this subject more in the near future.  But, start by giving two examples and labeling them cloud native application design pattern and anti-pattern. 

A Cloud Native Application Design Anti-Pattern

I'll pick on Drupal a bit (but with love).  If one installs Drupal at a cloud IaaS or PaaS provider then that does not make Drupal a cloud native application.  To me, this seems obvious but I am not so sure it is obvious in general.  The Drupal CMS is not a Cloud Native Application.  Putting Drupal, Wordpress, CMS XYZ of your choice on cloud computing IaaS or even PaaS provider of your choice essentially means you end up with an virtualized n-tier application running in the cloud with many of the same limitations of a hardware based deployment and only some of the benefits of being a cloud native application running on a cloud computer.  Yes, of course, and admirably (see billions of pageviews per month) drupal can run IN the cloud.  But, that does not make it OF the cloud.  But, I will say that based on personal experience even considering all this situation it's still likely the right choice in a great many cases to run it in the cloud.

A Cloud Native Application Design Pattern

If you want to see what CMS can look like as a cloud native application then check out the Lily CMS project. I personally might not choose this specific architecture and systems design to achieve the same goals.  However, there is more than one way to build a CNA.  They have done some great work there and are clearly on the right track!  It's excellent work and I have respect for what the Outerthought team has created with their platform.  It's actually potentially quite a lot more than just a CMS as well.  In any event, I think that with the exception of the default HBase high availability limitations (which will be addressed soon by HBase project I suspect) this can be considered a cloud native application.  Coupled with the appropriate monitoring, automation, and even cloud environment awareness it would be a very powerful cloud native application.

All of this summarizes to me as one very simple fact.  There is a tremendous opportunity ahead!  Exciting times.

Sunday
Aug072011

The NIST Definition of Cloud Computing(Draft)

click for original doc

I thought I'd start the week with a reminder of an oldie but goodie.  This document came out after the intial barrage of "what is cloud computing" and "cloud computing defined" posts from a few years ago.  But, I've always felt that NIST did a great job with it overall.

In my opinion it's still one of the best and most complete current definitions of Cloud Computing of any other out there.  So, in the off chance that you have not seen this defintion of cloud computing it is definately worth the time to read through.

One of my earliest definition articles from April 2008, Get Your Head in the Clouds, is still my most trafficked article on this site most weeks.

"Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud  model  promotes availability and is composed of five essential characteristics, three service models, and four deployment models..."  --NIST Cloud Definition

In particular I like the way they break down the characteristics, service, and deployment models. 

 

Tuesday
Aug022011

Welcome VGBuilder to the world!

Build custom apps + deploy them to your cloud host of choice -- straight from the command line.


The first applications using VGBuilder are live already and there are more in works.

This Cloud Native Application Development toolset was created out of a genuine need for a big project about a year ago that needed to move very fast.  There were no tools that we felt really met our needs at the time so we created our own.  That’s a story that will be told soon in more detail I expect.

I consider VGBuilder a Cloud Native Application development tool set.  It does not meet all my requirements  as you’ll note if you read the above article.  But, in time it will I believe.

To understand some of my thinking that went into VGBuilder it helps to read these articles:

Cloud Native Applications

I wouldn't claim necessarily that VGBuilder is revolutionary but it certainly is evolutionary.  One of the more powerful aspects of this toolset is that it removes the need for any middle man or centralized platform to get your ideas from your brain and into the cloud crazy fast and with quite a bit of style.  It allows you to build and deploy your applications in the cloud with almost no barriers.  It removes the need for PaaS services for many people right from the start.

Essentially your laptop or computer becomes your private cloud and either Amazon or Rackspace (supported today out of the box) becomes your public cloud. This is essentially a hybrid cloud model that puts all the power in the developers hands and automates almost everything that is often tedious or cumbersome otherwise.

Things are still rough around the edges.  As much as anything I want to guage interest and get feedback on this new set of tools.  There is a long way to go but preliminary case studies have been excellent.  

Please contact me or sign up on the early access web page for if you’d like to be kept informed of the progress and provided access to use the tools for your own projects as it is opened it up to more early adopters.

 

Friday
Apr222011

On Clouds and SPOF’s (or the Great AWS Outage of April 2011)


Just a couple of days after posting about cloud native applications Amazon raised the bar by having some issues in one of their data center regions.  These issues primarily affected EBS and RDS from what I’ve read.  So, pretty much everything one way or another since using AWS EC2 without EBS in any form for most applications that exist today is a little wacky for most folks.  This is because your EC2 AMI won’t persist through a reboot in the absense of the use of EBS.  Most folks have not reached the operational nirvana yet of full automated configuration management and application fault tolerance that makes this acceptable for them.

What level of SPOF (Single Point of Failure) are you are willing to tolerate.  So, I wanted to “scale up” the idea of the SPOF then bring it back down again.  Here we go.

If the earth stops working, so will your web application (admittedly there might be some satellite networks that don’t have this problem... but who cares at that point?)

So, let’s keep going.  Each of these is a potential single point of failure.

Earth > Continent > Country > State/Region > City > Neighborhood > Building > Floor > Room > Rack > Server > Server Component

And, at each tier, there are numerous dependencies and contexts to keep your service running at any given time.  There are the obviously ones like the above example where if the earth explodes the neighborhood is pretty much shot to hell also.  But, that’s obvious.  It’s gets less obvious when you dig deeper into the data center and see that there are 5 servers so that’s okay right?  Maybe. Maybe not. If it is something like.

Dynamic Name Service > Load Balancer > Web Server > Application Server > Database Server

Then those 5 servers/services might be in that one rack per data center per room per building per neighborhood per city per state per country per continent per planet is looking pretty vulnerable.  In the grand scheme of things the loss of one power supply in one machine could impact the entire planet’s capacity to retrieve whatever is on that DB that is so globally important; like a picture of your kid making a funny face on his 2nd birthday.

Do you think it is Amazon AWS’s fault if you put that database on one server in one rack in one place with no reasonable SLA and it goes away forever?  Not so much.  You are accountable and responsible.  You made that choice.

Now, how can we change this for the better?  We can develop applications that are able to tolerate the loss of a single point of failure at a sufficient granuality (Earth is a bit extreme today) such that our applications keep running when bad things like the AWS outage occur.  I call these Cloud Native Applications.  They have certain traits that should look a little familiar to cloud folks.

You cannot create a cloud native application doing things the same way you always have before.  It simply will not work.  The necessary software architecture and systems architecture has changed if you want your application to run on the cloud w/ no SPOFs.

Just needed to get that off my chest.  Some related links for good reading:

http://blog.basho.com/2011/04/21/Amazons-outage-proves-riaks-vision/

http://www.thestoragearchitect.com/2011/04/22/so-your-aws-based-application-is-down-dont-blame-amazon/

http://highscalability.com/blog/2011/4/22/stuff-the-internet-says-on-scalability-for-april-22-2011.html

http://www.infoq.com/news/2011/04/amazon-ec2-outage

And if your REALLY keen to write some CNA’s (contact me) and read...

http://www.infoq.com/presentations/Actor-Thinking
http://www.infoq.com/presentations/1000-Year-old-Design-Patterns

 

Sunday
Aug222010

Some Cloud Thoughts on a Clear and Sunny Day 

Cloud Computing is a deployment model and cloud computing is a business model.  Cloud computing is not some silver bullet magical thing.  It's not even easy *gasp* sometimes.

As a deployment model cloud computing can it is simply summed up as on-demand, self-service, reliable, and low to no capital costs services for the consumer.

As a business model it is summed up as, again, low to no long term capital costs (and the associated depreciation) and pay as you go service provider pricing models.  In reality these are mountains of micro transactions aggregated into monthly and yearly billing cycles.  For example, I spent $0.015 for a small compute instance w/ a cloud infrastructure provider because I just needed an hour of an Ubuntu 10.04 linux machine to test a quick software install combination and update a piece of documentation.  I'll get a bill for that at the end of the month.  Get this...

An hour of compute time costs me 3.3 times LESS than a piece of hubba bubba chewing gum cost me at $0.05 (one time use only) over 30 years ago. #cloud

Enterprises and service providers are learning very quickly from the how the early public cloud vendors how to do things differently and often more efficiently.  It was well summed up in the Federal CTO's announcement of the government application cloud.  Basically, that we saw that consumers could get IT services for orders of magnitude less than we could.  So, we're fixing that by emulating what the companies that service the consumers are doing. Smart.  Bechtel did this exact same thing years ago when analyzing that the cost per GB of storage for Amazon was orders of magnitude less than Bechtel could and asked the very important question why and then answered it very well.
A couple of years ago now I helped found a company called nScaled.   nScaled does, business continuity as a service.  It is only possible with the resources, at the price, and at the speed we have moved because of following cloud computing deployment and business models.  It would not have been possible for us to build this business when we did and the way we have without these models.  
In March 2008 I called cloud computing a renaissance.

It is my opinion that Cloud Computing is a technology architecture evolution that, when properly applied to business problems, can enable a business revolution. I've been saying this for a while but in recent weeks I have actually come to prefer the term renaissance over revolution.

Today, two years into a startup that uses the raw power of cloud computing deployment and business models across the board to enable new ways for companies to consume disaster recovery and business continuity solutions I can say without a doubt that I believe that cloud computing is a renaissance more than ever before!