Operator Review of new features - Blueprints on Blueprints (aka BOB) - aim: apply a consistent operational view to all the things
Specific information put into blueprints
Blueprint alerts
Operators in design summit sessions
BluePrint Review Process (gerrit?)
Stability of core components (reconnecting rabbitmq, transactional integrity, database issues, performance issues)
compute resiliency - reconnect, retry, recover for compute nodes
chaos monkey blueprint aka StackMonkey (TM)
HA model for every service
Keystone token thing
write out list of test stories
create a process for operators to submit test cases
Rolling upgrades and N-1 compatibility
downgrades (more thought at design level is warranted, but see also http://docs.openstack.org/trunk/openstack-ops/content/ops_upgrades-roll-back.html)
no-downtime DB migrations
Communicate to/educate operators that they need to do these things: propose backports, how to contribute, you can dump stuff in doc  bugs
specify the failure use cases we want to be addressed in standard testing
HA mode for every service (need to get operations criteria into dev discussion to fix permanently, rather than just these below)
cinder-scheduler
cinder-volume
neutron - no layer 3 networking HA
Make operators feel more welcome
at the summits
by giving them ATC status
Fixing bad defaults
health_check_frequency in nova - kills neutron (default was 60 seconds)
dhcp_timeout - 120 seconds, should be 24 hours
periodic_task_frequencies
Get a group of people together to work on packaging toolchain
Get a group of people together to work on operations tools/extension of the CLI to make it useful for ops
Tool to get VMs out of a hypervisor for hardware maintancene
Tool to delete things properly
Tool to replace a 5-line shell script that can "fix quotas"
Command-line utilities syntax uniformity
Ensure all error messages are actionable
Availability zones are a nova only concept
IPv6 across all projects
Distributed glance


####################
# Install//Config, initial planning & capacity planning  #
####################

####################
# Upgrades
####################
- TripleO - seems massively important for OpenStack adoption.
Is it? Is upgrades/install TripleO's responsibility? ... According to https://wiki.openstack.org/wiki/TripleO, it is.  Seems likes this question is similar to asking whether Nova is responsible for compute.
"TripleO is a solution looking for a problem"? ... How so?  Everyone needs to Install and Upgrade.  Is there an issue with how TripleO is designed?
TripleO = bare metal provisioning ... Should the architecture evolve so that it handles install / upgrades in a more general sense?
Image based deployment vs Image Plus config management
TripleO assumes your entire model is sussed out through OpenStack. It doesn't take into account the millions of dollars of IT infrastructure you've already got.  Does this mean the architecture needs to change?
"Looked at tripleO and saw it was focused on deployment. It didn't have an operational model. Community ignored everyone in the room." ... Lack of leadership or consensus around TripleO?
Then, there was tuskar.
Is baremetal provisioning somehting operatiors want OpenStack to solve, or do we prefer to use existing solutions (presumably we deployed things before openstack and have in house methods)?

When talking about TripleO do you imply use of Ironic? Yes. 
... but it doesn't work ... because of hardware discovery  ... Does this mean it is DOA? Or does it mean the architecture is immature?   OpenStack doesn't have many top level components. Does TripleO need to be re-thought?

(?) Core issue: ironic and tripleO are detached from a realistic environmeent ... Seems bad.  Does the Tech Lead of TripleO know this?

"I have an Essex environment that I can't upgrade" -- nobody can :-( I did it :) It worked :). me too (it's hard) (very hard).

* major changes address features, which is a reason to upgrade

Who is responsible for architecting the new feature sets so upgrades are doable? eg nova-network to neutron
* one thought: feature definitions should ensure migration is possible, and rollback is possible
--> because the cost of changes is so high

some database migrations are bad in this sense right now - because the gate allows rollback to be a noop
   * is that true?  I thought that's what grenade was supposed to address.. it used to be true.. grenade may have stopped this behavior

Q: how would common upgrade and rollback across all projects be managed?
- sounds like a problem for source control


##############
# Deployment   #
##############
-  Service idempotency/reload - should be able to recover or kill-off an  inflight build/operation that's interrupted versus leaving it building  indefinately

Packaging
a lot of people are doing their own packaging

"packages in RHEL and ubuntu don't work" - other than getting the first installation going, we end up building packages to make patches/fixes.
 solution: wait
 solution: take part in test days. subscribe to the mailing list.
 solution: submit bugs. works sometimes

suggestion: kill devstack and make it use packages -- question becomes which distros packages? (deb, rpm seems logical)  +1  0debian, ubuntu, centos, redhat?  (thats deb, and rpm) [[ Developers like devstack -- think of these folks often not too systems savvy and not desirous of chasing each packet, devstack helps them get up and running in a development environment (if everything works) in less than an hour)

who is interested in getting together to work on packaging toolchain for openstack (finding one and glueing bits until it works)
- craig tracey
- john dewey
- it already exists guys... (link :D), https://github.com/stackforge/anvil, http://anvil.readthedocs.org/en/latest/topics/examples.html#preparing (and see building)
  - anvil downloads openstack projects from git, scans & analyzes for python dependencies, downloads & builds all (skips ones say in epel) python dependencies -> rpms, builds all openstack projects -> rpms, creates 2 yum repos (one for deps, one for core, including all srpms and binary rpms)... Just need to make sure it works on other platforms if we go with this.  The hard work is already done. (does not currently build for deb pkgs, but abstracted so this can easily be done, see hierachy @ https://github.com/stackforge/anvil/tree/master/anvil/packaging ) ++
  - #openstack-anvil channel 
  - has been working for yahoo (godaddy and a few others) for 1+ years... -- works on centos, rhel, (likely fedora)
- omnibus (https://github.com/craigtracey/omnibus-openstack / https://github.com/craigtracey/omnibus-openstack-build )

development cycle moves beyond what is "supported"

why do people package?
* need isolation of virtual environments - different versions of openstack services on the one machine
* incorporate features from a newer version
* bug fixes in house until it gets into upstream distros
* vendor fixed drivers that are not yet part of upstream
*  consistency = being able to make sure when you're able to deploy it is  the same all the time. want to know what you've deployed on the  clusters. sometimes upstream changes package versions which can break  things
* upstream dependencies break things - file bugs in PBR, because it was trying to pull things out of git repo
* can't assume internet access from packages
* we want to release twice a week (continuous deployment)

Are you using distro packages and repackaging, or source? source. and some distro.

do you package other software (eg apache, mysql). most: no.
What about the kernel or say ovs?
the voilitility of the kernel or ovs is much lower, so it's not needed.  Also, openstack is a distributed system that is run as a service.  You have to verify that one component at one version is compatible with another component and a separate version.  Also, it has a queue and a database who's data need to be migrated and maintained.

Is OpenStack going to get to the stage where you can just use the upstream package? 
* "no, OpenStack is a service, not software"
* the level of innovation and change in OpenStack is too hard for the distros to keep up ... as it starts to slow down, maybe
* the reason you don't have to package apache is because it doesn't change every month
* openstack is a distributed system which makes the dependency chain complicated - perhaps stronger abstraction between componets could decouple this?


is there enough interest in an ecosystem/community on users of openstack to support stuff like packaging?

Customized deploy guides:
    - Have a form that the deployer fills out to pick various deployment choices: e.g. nova-network vs. neutron, HA choices.


##################
# Operations Issues (CD etc), ongoing operations, problem determination, monitoring, breakfix, tools, DBs (crosses other areas, but warrants some focus)  #
##################

- Common tools for fixing issues? (or should we just fix the underlying bug)
--should track it here: https://wiki.openstack.org/wiki/OperationsTools

Tool being made @ https://wiki.openstack.org/wiki/Entropy
- #openstack-entropy channel

Live migration
Issues i have with migration:
- Migration between different CPU types is currently not supported: https://bugs.launchpad.net/nova/+bug/1082414
- Needs a proper "host-servers-migrate" command (where you can also specify block migration). 
  Current one seems to try to copy the instances with ssh / using resize?? (haven't really had time to deeply dive into this yet)

- Related to block migration:
Nova block migrate will fail if image or snapshot the instance used to boot is no longer available => https://bugs.launchpad.net/nova/+bug/1281087

P.S.
Seems that there is major overhaul going on by moving migrations to the conductor, not sure what is currently included.https://blueprints.launchpad.net/nova/+spec/live-migration-to-conductor


Nova logging
- Currently the compute logs sometime mention that there are more machines on the hypervisor then there should be ("Found 29 in the database and 31 on the hypervisor") It would be nice to have a tool to get an overview of these machines and clean these.

HA
- Ideally, all OpenStack services should fully support HA modes (ie. active/active behind a load balancer)
- If services don't support running in HA (cinder-volume, cinder-scheduler, etc), it should be explicitly stated.

Command-line utilities syntax uniformity
- The command-line utilities between the projects are very inconsistent in the usage of arguments. 
E.g.1 nova/neutron/glance/cinder use "show" to get details of an object while keystone uses "get"
E.g.2 Within one tool: nova show / nova aggregate-details
E.g.3 cinder uses display-name other uses name
E.g.4 nova list vs glance image-list
It would be good to have all utilities use the same syntax for similar functionality
Note:  There's a project to create a unified CLI, but we need PTLs to actively push for it
Need to ensure the approach taken will actually fix the consistency issues too.

 can't have 2 ports on one vip (https://blueprints.launchpad.net/neutron/+spec/lbaas-multiple-vips-per-pool)

An attempt was made to capture problems here: https://wiki.openstack.org/wiki/OperationsUseCases, perhaps these should be distilled into that wiki, then parsed out into bugs?

Most people have a 5-line shell script that can "fix quotas"
 - I think at least a good chunk of this is fixed in later versions, indicating a need for closed loop communication about known issues, i.e. "a fix for this is in upcoming version X"
Collect tools and consolidate into a "Troubleshooting Toolbox"

Logfile collection and correlation

Quotas
* grizzly to havana upgrades kills quotas

Keystone changed the way it stores stuff. Dumps the entire JSON database, and depending on the size of the databse (# endpoints), things break. <-- This bug was fixed recently ... the "fix" was to increase the size of the headers

had to change neutron because keystone gets smashed

can't run multiple workers on keystone

asks for a new token every time.

even though the token is valid for 24 hours, applications don't reuse the tokens

database
performance problems
* migrations: upgrades renaming a column.. this can lock a table for hours if there are lots of rows
* volume of data returned in query can fluctuate with various changes

Backward compatibility of releases that includes and/or separates DB upgrades before pushing the actual release.. Doing DB upgrades before and ensuring old releases are backward compatible and then do software releases


neutron database uses

conductor
becomes a single bottleneck for database access-you can run an arbitrary number of conductors it scales very well horizontally
Increasing # of conductor workers in the config on each host running conductor also helps

SQL queries are not efficient - large query and filter in the code
eg list instances with some filters - queries nova for all instances
eg glance for images
eg LEFT JOIN INSTANCE_metadata - once per minute <-- bug, fixed later

potential problem: sqlalchemy might not make it easy to do the right thing

While exploring DB clustering, we found issues with some options becase NDB wants no row to be more than 65k of data, but it's not guaranteed with some tabes - instances for example.

ceilometer
requiring mongodb is either a high or low bar
some people don't want to have to run mongo, or depend on it for your billing information

it's a problem to add extra components. eg if you add couch, etc ops team has to learn that and work out how to back it up etc


Messaging system
"Most problems I have with OpenStack always boil down to clustered Rabbit"

https://bugs.launchpad.net/nova/+bug/856764
https://bugs.launchpad.net/nova/+bug/1188643

might be fixed in 3.2 to some extent

clients don't reconnect properly

running connections to rabbit though firewalls or load balancers can cause problems
eg compute controller to compute nodes, different security zones

should discover connection was dropped and reconnect automatically. try to also kill previous connection

notifications queue keeps building up since nothing consumes it

default for notifications should be noops


performance problems
user creation

vm creation

keystone getting hammered when performing tasks

neutron chokes when spinning up more than 10 vms. <-- could have been addressed. was due to SQL timeouts

neutron is not cell aware

heat
havana ssl proxy - 

keystone
ssl termination doesn't work
heat and ceilometer don't use the client correctly, especially if you are using custom certificates from an internal CA
glance doesn't like images being uploadded over ssl (are you referring to performance of uploads? I can upload to an SSL'd glance endpoint fine, but have performance problems with python-glanceclient. Using curl is fast, glanceclient -- not so much)

action: fix testing so things work. find out of testing uses real certificates. test with offloads.
tests for internal CAs
tests gated on for SSL and testing of SSL offloading
allow providers to provide client certificates, so that nova (and all services) can use a client certificate to connect between SSL endpoints (https://bugs.launchpad.net/nova/+bug/1279381)

config options in OpenStack
* more choice is a good thing? mixed response
* good defaults is key
* dynamic calculation where possible
* deprecation standard for configuration options
* every configuration option is a choice that a developer must make. Every choice a developer makes potentially diverges that deployment from the rest of OpenStack deployments.
perhaps a tool where you input what kind of cloud you're building and it works out the options

sometimes it involves reading the python code to work out what a configuration option means


health_check_frequency in nova - kills neutron (default was 60 seconds)
dhcp_timeout - 120 seconds, should be 24 hours
periodic_task_frequencies

default: open vswitch - should be linux bridge


don't store credentials in the configuration files

scheduler only cares about RAM by default, there are other resources such as IP addresses that should be factored in

networking
openvswitch uses 9 virtual interfaces to work
need 6 switches for a public ip address

just want something basic and simple

Perhaps many people currently using OpenVswitch in production could be using linux-bridge and simplifying their deployment significantly

when you do ovs show it's just a mess

it makes troubleshoting incredibly hard

ubuntu packages are heavily hardcoded for open vSwitch <<-- weren't "opinionated defaults" asked for earlier? +1, can't have opinionated defaults *and* match all use cases.Original input related to a change of opinion from OVS as default to Linux Bridge as default, not for being less opinionated

HA mode
these services can't run in an HA mode:
cinder-scheduler
cinder-volume

neutron - no layer 3 networking HA implementing VRRP among multiple L# genents seems an obvious thing to do here...

nova-scheduler - if you have large numbers of VMs submitted at once, all schedulers in the load balanced cluster all try to choose the same hypervisor. The response was to use the RetryOnError option.

If we don't have design standards, acceptance criteraia, reference architecture ... how do we expect to develop services that can run in HA?, or how can we expect different results than what we see today

some people aren't too happy with pacemaker

nova-compute daemon is talking to only a single hypervisor. this is an issue for dealing with the number of nodes that need to be upgraded/talk to rabbit/talk to database. Perhaps introducing heireachy is a solution ...

trivial hardware interventions become annoying. eg the live migrate all VMs from one machine to another
need more automation

deletion of things
hypervisor failures - not all deletions work: tenant removals, project removals, massive hypervisor corruption/removal, removing an image from glance (it breaks right now if VMs use this as a backing image), VMs alive but Openstack thinks they are dead, 
permanent removal of hypervisor from the services catalog


state integrity is missing <-- could be solved by taskflow and tuskar
^ eventually you will fall off the state-machine rails. really just need a tool to recover and get back on the rails. e.g. nova reset-state --force works 90% of the time


Availability Zones, Regions
doesn't align well with "the industry"
lack of consistency in terms of definition within OpenStack
AZ is a "user facing separation of your hypervisors"
Region equals availability zone 
cells are good for geographic separation

problem is: AZ is a nova only concept


Assumption is that there will be only one keystone.
The problem is that every attempt to provide some kind of segregrated keystone ends up in the full-blown "federation" debate.
LDAP as the solution? (ldap implies a whole other infrastructure. Shouldn't Keystone be able to stand on its own? Why? You're all up in my parens, yo. )
t
Some people actually prefer keystone to ldap ;)


swift
it just works
it's a simple model
there's no message queue
there's no database
how did you instill this mentality into swift? the people

classify features based on their maturity


##################
# User-Facing Issues#
##################

* Dashboard: make the default tenant a user sees when he/she logs in the default tenant that is set in keystone (when a user has rights to multiple tenants)
- Handle broken regions gracefully (now the dashboard breaks when one region is not working
- Glance member-create (image sharing) functionality in the dashboard

nova API v2 vs v3
* Don't break userspace - it causes badness
* when you break user space, developers start looking at other options and if they leave, it's unlikely they're coming back
* (debate) we should support v2 "forever" "a long time measured in years"
* ops: you should be able to upgrade your system without the user-space breaking

* how does it work with the endpoint catalogue

API
keystone API varies greatly between good functionality in V3 and stuff that we need from V2


action: add a statement around API versionings and version deprecation (for ever, or for at least n releases)

project vs tenant: no consensus

errors
the message that users get back is not actionable

keystone is particularly bad: A raised except where the exception is thrown away, so the user never gets the message containing the actual error

eg when a hypervisor can't be found in nova boot request, the message that you get back is "3 attempts to schedule this, couldn't find a hypervisor" then in the nova logs you get the real error "couldn't allocate port"

action: when there is a user-facing error message, the error message needs to be reasonable


catalog the all the project API ERROR messages. Could this be automated like the configs?
Instead of returning a stack-trace to the user, return a descriptive error message + the req-UUID which will identify all log lines?

Consistency in the CLIs from project to project
    - kill prettytable
    * this is being addressed in the unified SDK project https://launchpad.net/unifiedsdk https://wiki.openstack.org/wiki/SDK-Development/PythonOpenStackSDK
    * a proper CLI will then be built on that unified SDK
    * this is a huge undertaking and will take a long time

IPv6 across all projects

server status - single API endpoint per service to verify it's OK



##################
# Feature Requests  #
##################

HA instances: flag an instance highly available, this  instances should be booted somewhere else when a host dies.
Flavor quotas for nova: integer quotas on the number of instances of flavor "X"
Cinder quotas need to be per-flavor. You may have a flavor that is SSD and another that is HD.
Keystone docs. Seriously, there are only a few out-of-date developer docs.
CLI: all python-*client tools need to have a raw-mode and a lot of TLC.

Feature to provide information about which API calls and versions users are using


tempest assumes that it's hitting devstack, but I want to use it for my integration testing(oreven against my production) cloud
and as a non-admin user

disabling resources by name

glance:ability to say an image is deprecated

make cinder so project/tenant and owner can be the same

distributed glance

image resynch

aws AIM (Identity and Access Management)x

#############
# Community  #
#############

Something that currently is not within Openstack but I would like to see is a "vm marketplace".
A place where al kinds of images are published (ubuntu/fedora/SL6 etc. similar to e.g. www.vmware.com/appliances/‎) We would also like such a service internally. 
You can currently set images to public but before you know it you have a list of 100 different images and people do not know what to choose.
Now we use image-sharing but it comes with headaches.
A marketplace where you can easily import from / add to your tenant would be great.

Blueprints don't have enough information
Blueprints don't have a good mechanism for feedback (no discussion ability at all, and completely separated from the code)
Blueprints have no alerting mechanism to those who need to comment on it
eg VPC blueprint

How to for operators debate a blueprint?
How to for operators find a blueprint?
We need a way find blueprint features, impact, 

difficult to find justifications

Discussions on features need to be able to be found later, including justifications
The salient points of a discussion on IRC need to be surfaced in the Blueprint (wiki page or whatever) with the IRC log linked to

Action: Ask PTLs to do a better job of pushing back on blueprints, with respect to detail
What detail do we want/need?
* no one liner
* discussion of design justification - especially the negative part of the decision. We didn't do this because...
* if there is user impact
* level of difficulty to operate
* transition planning - how do I move from this to that?
* Better visibility into how PTLs set blueprint priority
* Not enough people submit bugs, feedback into documentation




https://blueprints.launchpad.net/nova/+spec/aws-vpc-support <-- lacked discussion. Had to contact devs to avoid it getting implemented before discussion

IRC logs are useful for recovering discussion of new blueprints.
    - not all rooms have evesdrop turned on, e.g. #openstack-nova
    - the IRC and non-IRC discussions should always be distilled into the wiki blueprint
That's sort of the point: http://freenode.net/channel_guidelines.shtml
"If you're considering publishing channel logs, think it through. The freenode network is an interactive environment. Even on public channels, most users don't weigh their comments with the idea that they'll be enshrined in perpetuity. For that reason, few participants publish logs."
Flip side is if BP discussion occurs in unlogged IRC (frequently) then it occurs in a vaccum, updates to the BP and specification following such discussions aren't made. Then devs wonder why ops dont understand reason for implementation choices.

Voting on blueprint priority - this is currently based on PTL decisions
For bugs we have "this bug affects be". There is no such feature for blueprints

Feature request: Blueprint Alerts (when a new blueprint is created, spam me)

Talk re "product development in the open" getting more info out up front for ATL summit: https://www.openstack.org/vote-atlanta/Presentation/product-development-in-the-open

potential underlying problems(?) - blueprints are not required for submitting code
- no capability for having any sort of discussion in the blueprint feature on launchpad right now. 
- feedback happens when the patch is submitted, which can be a bad thing and result in hurt feelings :)
- (debated) only reason blueprints exist is for project tracking


it's not just about blueprints. it's about planning

Idea: "Blueprint" as a text document to be reviewed in Gerrit x -- interesting. I kinda like this.  -- yeah me too, like the possibilities

tool repo

bugs, after submission:
    * aren't fixed soon
    * aren't backported
    * aren't packaged by the distros

No operator on every committee, so ops views aren't being represented

arrange operators to go to design summit session to represent ops

Operators don't feel welcome in design summit sessions
* we need communications around the place to get ops and devs in, but "keep marketing people out"
As an operator who is not an ATC, I was explicitally turned away from the "dev lounge" in HK.  The dev lounge is a good place to discuss code one on one with developers and that's a shame.

Action: the "dev lounge" needs to be renamed to be inclusive of ops

Design session summits start with up to a few months of previous discussion as background. This can be very disjointing for someone who isn't involved in the blueprint.
Asking the participants in design sessions to go back to square one to explain it to ops unfamiliar with the Blueprint can be restrictive to the devs (who are there to make potentially critical decisions) who have already invested months in the BP.
Perhaps design session summaries should reference blueprints that will be discussed so everyone 

proposed actions:
better blueprint impact, feature summary than what we have with launchpad. What will this proposed blueprint change in code, behavior?
an overall next release summit blueprint summary as next summit preperation. Summarize all the blueprint changes. 
Provide insight to which areas of propsed change are the most important to me. 
Get me up to speed on the proposed changes so I can participate in the design sessions on those blueprints
Deployment Testing to support deployment and upgrades to be part of the incubation approval process

The problem is that there is no list of things that need to happen for projects to move from incubation to core
eg "must work on 10,000 hypervisor"

People might be interested in opt-in stats

Underlying/Fundamental problem: only form of communication is like between developers. It's just "code wins".
How do we get operators involved?

Caused by(?): Lack of understanding in the ways of participating


Backports need more attention (https://wiki.openstack.org/wiki/StableBranch)
* operators are not going to switch to trunk (unless they're doing CD ;) ), so some more attention to the 'current' 'stable' version is good
* six months is too long
* backport requests don't get sufficient attention
* other systems offer "top tier" support for the latest N releases [OpenStack does N-1]
* Community supported stable branches beyond N-1

Backports - are they denied, neglected or just not proposed?

Can't add database indexes through the StableRelease system.

https://bugs.launchpad.net/nova/+bug/1112483 <-- actually an N-2 backport

but, have to wait until trunk is fixed before the backport can be done

(critical) security issues get fixed in all branches at the same time

"backport-potential" are backported by devs based on demand, have to indicate you want the backport, or put up a backport patch



#################
# Documentation     #
#################

we need a place that has curated content that tells me the best practice eg to monitor/log things
for example: too many ways to do HA for the database
long term: reference implemented
answer: it's the ops guide, contribute dammit.

docs lag behind the projects pretty far, perhaps ops guide could keep up

About 1/3 of people have contributed to docs

Steep learning curve on contributing to docs
  * tools and language

docs reviewers are too nit-picky: tabs vs spaces
  * https://wiki.openstack.org/wiki/Documentation/Conventions

are people relying on internal notes, or official docs?

cinder list of config options.

Action: communicate that docs team will take dumps of plain text and put it into a book
Is Docbook keeping ops away from o.org docs, and if so what blocks docs from moving to Asciidoc, Markdown, RST etc.
Action: communicate to ops that you should just paste handy hints in a bug report  

push for actual python API documentation





############
# Misc Notes #
############

- We need to get the launchpad answers shutdown and moved to ask

Feedback mechanisms
Ask
Product forums still active
Migration to ask in progress
ML
IRC
User groups
Summits
Ambassadors
User committee, surveys
Bugs
Blueprints
Code
Reviews
documentation
etherpad




There are a range of aspects that operators require on a day to day basis for every component.

We,  as OpenStack cloud operators, recommend the following items be detailed  during the design and development process of major features:

Service or feature monitoring
Does the code have impact on metrics, alerts, health checks or logging?  If yes, please describe the impact as appropriate. 

Lifecycle management
Does the code impact the service or feature's lifecycle?
Does the code have impact on high availability, test, upgrade, roll-back or restarts?

Docuementation
What documentation changes need to be consider for this code? Does the object state/flow need documentation?  What about usage and debugging info?


Operability Checklist for "the new thing"
    Monitoring
        Metrics
        Alerts
        Health check/http ping
        Logging
    Lifecycle
     HA
     test
     upgrade
     roll-back
     restart
Documentation
    object state/flow
    usage
    debug

What detail do we want/need?
* history of the thing
* how it was approved
* who approved it
* why it was approved
* not a one liner
* discussion of design justification - especially the negative part of the decision. We didn't do this because...
* if there is user impact
* level of difficulty to operate
* transition planning - how do I move from this to that?
* Better visibility into how PTLs set blueprint priority
* Not enough people submit bugs, feedback into documentation

Volunteers to continue this work:
Tom Fifield
Reinhardt Quelle
Andrew Mittry
Tim Bell
Matt Van Winkle
Scott Devoid
Hernan Alvarez
Carmine Rimi
Craig Tracey
sri Basam
John Dewey
Justin Shepherd
Paul Reiber


Summit Discussions

aim: share knowledge, share experiences

Action: operators in design summit sessions

Action: the "dev lounge" needs to be renamed to be inclusive of ops, e.g. "DevOps  lounge"

difficult to get to the summit for the whole week? nope

* short session on Monday with PTLs, aimed on getting operators into design sessions
* wrap-up on Friday, with summaries and best practices

operator-proposed sessions, or generic sessions (eg "monitoring", "logging", "best practices", "architecture show and tell", particular tools: chef, puppet)

round-table unconference for operators to just "share" eg pain points and how they solved them

add an "ops" session to each project in the design summit - the first purpose is for developers to ask the operators things :) second for the operators to tell the developers things :)?
Is it worthwhile to have the developers create questions before the summit?

cross-project discussion, at the end of the summit, side by side with the release meeting?

anything that can help identify "marketing" vs "real user" presentations
- one note/new for this summit, all operations track chairs are operators, not vendors ^______________^
    - chatham house rules?

Sharing operational experience and working with PTLs and projects for bugs and new features.

About 10 operator feature topics to cover, see top of etherpad


Representation of Operators in the Community

Will the structure of the user committee change? yes :)

How do volunteers volunteer for the user committee?


Topic revision: r1 - 2014.03.05 - PaulReiber
 
Copyright © is by author. All material on this collaboration platform is the property of its contributing author.