IT Management, Pop Rocks Candy and River Rafting

Friday, June 8, 2012

Mobile SATCOM Management: Bandwidth In the Clouds

The need for speed on-the-go is often coined as COTM – Communications on the Move. Imagine command and control intelligence with high-speed bandwidth rates for encrypted high-def video, voice and Web browsing. Because of the increased capabilities and economics becoming much better, mobile SATCOM adoption is taking off.

Viasat looked for standards-based technologies and COTS software to build a highly custom teleport bandwidth management system called SAM – Satellite Access Manager. SAM runs on a 2-to-8 core Linux blade server in a high availability mode. They typically do not have the luxury of big footprint hardware. Using Ka-band or Ku-band satellite uplink with ultra small 12-inch tracking antenna, bandwidths reach up to 8-10 Mbps. Even higher data rates are possible with larger ground units. As with every network infrastructure, it needs to be managed and controlled. However, in this environment, very little was truly off the shelf.

To build SAM, Viasat employs a development methodology known as Agile Scrum, which is an iterative and incremental approach to software application development. Small tasks are identified and an estimated commitment for the sprint goal is made, then reviewed, and next tasks are prioritized by customer and internal stakeholder demand. Requirements change and churn based on an unpredictable nature. They accept that problem because features may not be fully understood or defined, so breaking down the tasks into smaller chunks reduces risk and they can respond to customer and market driven deliverables quickly.

Another challenge is not knowing what has already been developed and in the public domain. Developers tend to build from the ground up, and keeping aware of the technology stacks available keeps cost and development time down. Core COTS software functionality is out-of-the-box and they can focus on features particular to their space and spend time on actual customer requirements.

The management application has some of your traditional FCAPS functionality, but the main components are to monitor and control equipment and services. Topology is an on-the-move, hub-spoke architecture with a few thousand devices. For fault, it remotely monitors alarms on equipment that has gone down or degrading, equating to loss of service. SAM collects a large number of events and aggregates the data to a consolidated management view. The communication layer between remote and central server is SOAP / XML. Operators have a global view and can detect loss of service proactively before the customer calls come in. Fault correlation reveals very complicated outage scenarios into simple user interface displays. The NOC user is concerned with keeping the lights on, and their job is to simply make sure service is not interrupted. They spend more time on actionable tasks rather than troubleshooting. It's automated through the network management application.

The system handles intense signal processing. Because bandwidth is king, there is a ton of performance trending, business analytics and intelligence data collection. SAM takes in quite of bit of data from Eb/N0 (energy to noise ratios) compared to bit error rate performance to dropped CRCs. Managing device configurations and services are key too. They provision and audit multiple remote users on the network within a common bandwidth pool, provide dynamic assignment of bandwidth and prioritize communications.

SATCOM bandwidth increases and costs coming down prove to be an attractive solution. Provisioning and managing fault and performance data ensures service availability just like any other telecom application.

Thursday, January 12, 2012

Security in Network and Element Management Systems: Genband, Motorola and L-3 Communications Style

Security is getting enormous attention these days and it’s easy to understand why. Selling into carriers and government is big business and architecting and building secure management systems is essential.

In this blog, I’ll discuss a few representative NMS/EMS use cases and cover the three key security layers required in management systems: Authentication, Authorization and Audit (AAA), device-to-application communication, and the sometimes forgotten layer of inter-system communications.

The First Layer of NMS/EMS Security: AAA

NOC security usually assumes a good physical security system is already in place. You would see the guard station with the security cameras and maybe see a biometric entry mechanism, but once you are in the facility, it is all about the management system software.

The first layer of NMS/EMS security is generally around AAA or Authentication, Authorization, and Auditing.

Authentication is accomplished through a challenge-handshake mechanism where the credentials of the user are verified using a three-way handshake. The passwords are never sent across to the authentication module; rather a one-way-hash (called key) is used. This provides protection against playback attack using an incrementally changing identifier and a variable challenge value. Polices with strong password rules or the use of tokens can also be employed.

Once the user gets authenticated, he or she is given authorization for access control. Support for user groups provides a mechanism to collectively associate access rights to a set of users. Also, it is not sufficient to just tie up the access rights of a user with the operation performed. Hence, it becomes necessary to have a framework where the permissions are associated with the subsets of objects concerned with the application. This in turn delivers fine-grained access control. The authorization policy is designed with "Fine-Grained Access Control" as the focus. With a vast number of operators using the application, it’s essential that each one works within the allowed space.

Considering the complexity of applications, the Access Control Policy should have the flexibility to define access rights of a user to operate on a subset of objects that the applications work with. The security service achieves this by defining a set of authorized views called scopes. These authorized scopes consist of sets of properties associated with the user operations. Thus, managed object properties such as network, IP address, node, type, etc., are used in authorized views to control the access of the users to a specific type of device within a given IP range in a specified network. Database access or device configuration access may be limited to a few operators as well.

Auditing is about monitoring what the user does from the moment they sign in including the time and status of operation performed. This enables the network administrator to take necessary steps when an unauthorized execution is attempted by any user. Not only for security purposes, audit controls are extremely useful for debugging issues, for they allow you to determine what users were on the system before, during and after an incident so you can reverse engineer problems.

Device-to-Application Communications

The second layer of NMS/EMS protection is securing communications between the management application and devices across various protocols.

Of course, the first thing people think about are the various encryption standards. In the telecom business, the most common is SNMP v3 that can support SHA, MD5 or DES encryption algorithms. In the government and military space, SNMP v3 with AES encryption as defined and in some cases, mandated by the National Security Agency (NSA).

Besides the secure protocol layers, the management system has various infrastructure components, each looking through various ports. The management system should be flexible enough to be able to assign non-standard port configurations, harden the system by design and be able to monitor port activity.

The Third Layer: Inter-System Communication and Server Security

In the past, people figured that AAA and securing the device to the app pipe was sufficient protection. But to be truly secure today, inter-system communication is also vital.

The NMS/EMS can be deployed in various environments where it needs to support different data stores depending on the requirements. Different data stores like Relational Database, XML, LDAP, NDS, etc., can be integrated. The security module provides administrative interfaces to configure the data store.

In addition, an NMS/EMS can operate across several IT architectures. For instance, the back-end server, database server and a presentation-layer server are often components running on different hardware. Server-to-client communication and database access need to be secure by using SSL or secure RMI (Remote Management Interface). Remote access is set up via HTTPS.

Then, the physical server environment must be hardened. Several steps are involved, such as: ensuring OS patches are up to date; tuning the OS to stop unwanted services running in the system; removing unwanted user accounts; setting a short timeout value for the root account; setting BIOS passwords; and setting automated notification triggers when a list of commands is executed or when system files are modified. Often overlooked, the checking third-party software component configurations are also key.

Use Cases

Here are three representative cases in telecom, military and mobility apps to show how security is being applied in management systems.

Telecom Systems – GENBAND, an innovator in carrier VoIP systems deployed in Tier 1 carriers such as Verizon, views security as essential and a key differentiator. To harden its management system, GenView, GENBAND typically secures NMS-to-NE communications with protocols such as SSH, SFTP, IPsec and SNMPv3 – depending on customer requirements.

AAA is achieved via a RADIUS-supported central security server with configurable password-reset policies. The central security server can also be integrated into the customer AAA system using standard protocols such as Radius and LDAP. Single Sign On (SSO) is provided when launching applications within GenView. Alternatively, the app can be accessed remotely via HTTPS.

Other GENBAND security measures include: pushing performance data via secure FTP, hardening the OS, using restricted ports, conducting periodic vulnerability scans, developing rules to better manage loads, and enforcing rigorous backup/restore procedures to protect data from being corrupted.

Military Grade – An expert in military-grade security, L-3 Communications deploys Type-1 voice/data over IP technologies for governments around the world. L-3’s management system uses many of the same secure infrastructures as commercial carriers, but also supports High Assurance Internet Protocol Encryptor (HAIPE), a National Security Agency-certified technology.

HAIPE is encrypted utilizing Advanced Encryption Standard (AES) algorithms over SNMP v3. The security aspects of L-3’s management platform encompass access control, authentication, data integrity and end-to-end network traffic protection with dual IPv4/IPv6 encryptor capabilities. From a network management point of view, its NMS features real-time equipment fault and performance status, network provisioning and automated policy changes.

L-3 also performs extensive security level checking with an emphasis on device-to-device authentication, audit logging, secure remote software updates, and access control lists. The system is also hardened to operate in extreme temperatures in the range of -40 degrees C to 60 degrees C. The system is also built to withstand vibration/shock, sand, salt and other harsh environments that you don’t want to be in.

Mobile Intelligence & Public Safety – Motorola Solutions ASTRO and Public Safety LTE are leaders in the world of secure, real-time voice/data network for emergency response and mobile intelligence. Their systems, which service local fire/police and state governments, employ end-to-end AES over SNMPv3 for protocol message integrity and authentication. Similar to mil-spec, these systems are rugged, mission-critical and need to be managed via a central console.

These Motorola systems allow operators to troubleshoot faults and remotely configure/optimize system parameters. Since the systems are deployed for government agencies, they comply with FIPs 140-2, a government computer security standard. These days, Motorola Public Safety is moving beyond on-premise systems to managed services and a cloud core offering with guaranteed service levels.

As our brief uses cases show, security is a complex problem area that defies easy answers. Carrier and governmental security requirements are high, yet this has been done before and can be economically accomplished with the right tools and procedures.

The Business of Sports and Information Technology

An upset of epic proportions, the off the scale magnitude, the incredible happened this past weekend. The Indiana University Hoosiers (unranked and my Alma Mater) defeated the NCAA #1 ranked team, the Kentucky Wildcats, on a last second 3 pointer at the buzzer. The lead changed hands three times in the closing 2 minutes, Score 73-72. The barn burner could not have been predicted better.

Not only the significance of this win, the Hoosiers has not beat a ranked team since 2002 and have not won the National Championship since 1987 (under Bobby Knight fame and the heroes of Steve Alford and Keith Smart...And Yes, I was at the Fountain in Bloomington, but that is another story). Let me put this into perspective, there are over 340 Division 1 schools in the nation. It takes a lot of talent and hard work to be #1.

With the Indy Colts in the cellar without Peyton Manning, the State of Indiana needed and shot in the arm, a hair of dog pick-me-up, a just a plain old fashion confidence boost. Even the engineers from Purdue were cheering.

I know what you are thinking. What does this have to do with Business of Sports and Information Technology? Two Things:

1. Just like sports competition, there is competition in business. If you consider the breadth of the ManageEngine product line, we compete with over 100 other software vendors in the market. That's a lot of choice. ManageEngine has over 20 products covering network management, application management, desktop management, Active Directory management, log analysis, traffic analysis...and there is more. Then, if you consider the depth of the ManagEngine product line, it covers operational, security and compliance management. Each category has several distinct players in the marketplace.

2. The Sporting industry is big business. They have the same needs as any other business. They have networks and servers to run their infrastructure to serve their employees, their players and in some cases, their Fans. And Fans are pretty loyal. The parallel that I will draw is our customers are Fans. By last count, we have over 50,000 customers and the vast majority renew year after year. Price competition will only take you so far. High functioning technology and high quality support proves to be the determining factors on the playing field of IT Software.

I have compiled a list of Sporting companies who use ManageEngine. From the major leagues, the Sacramento Kings, Chicago Cubs and the 1st American League Central, Detroit Tigers, use ManageEngine. USA Olympic Volleyball too. Not only from the USA, but the Australian Football League and Chelsea Football Club use ManageEngine. These teams need to play somewhere and the Philly Comcast Spectacor, home of the National Hockey League’s Philadelphia Flyers and the National Basketball Association’s Philadelphia 76ers and Mercedes-Benz Superdome, home of the National Football League’s New Orleans Saints and the Allstate Sugar Bowl, use ManageEngine. These teams have players who need equipment and Nike Bauer, the best Hockey skates in the world (I know, I played hockey for 25 years) and Easton Bell Sports, leader is protective sports, cycling and motorcycle helmets (that I personally own), both use ManageEngine. And it would not be complete unless I mentioned that SportingIndex.com, the world's #1 online sports spread betting website and MapleLeafSports.com, proudly serving collectors of sportscards and autographed memorabilia, use ManageEngine.

Friday, December 9, 2011

Performance based IT Shop Part 3: Architect Level

IT problems usually need to be handled by the network engineer or Systems Engineer. These people are the craftsman of the IT trade. However, the solutions no matter how robust should be run through the Architect. Every IT department should have at least one person who sees and implements the big picture. This big picture is knowing the overall business goals and the limits of technology, but also the external governance and regulatory issues too.

The CIO sets the vision for the big picture. The Architect executes. In our business, while all the roles are valuable and important , the role of the Architect is critical to ensure that business goals and objectives are being met in a highly effective, efficient, and cost sensitive manner. These are the folks creating the next generation IT infrastructure. The CIO and the Purchasing person are just rubber stamping it… ah, I mean Approving it as Service Desk Request. In my previous Blogs, I talked about IT shops having a hodge-podge set of tools. The goal of the Architect is to take what they already have and make it work or blueprint a plan to make it work better.

The Architect can filter the tech speak to business speak. They are able to translate the Key Performance Indicators (KPI) to the business services and identify what is important to the IT goals and objectives. They are in alignment with the CIO vision. They tend ask the questions;

Where is performance effected the most and Why?

Is computing capacity enough to handle the current and growth situations?

Is there effectual support for the end user?

What is impacting SLAs and why?

Does the solution offer synergies and solutions to each business segment or LOB (Line of Business)?

They look to our diagnostic and analytical products like OpManager, Applications Manager, Netflow Analyzer, Eventlog Analyzer, Firewall Analyzer and IT360. Sure, they want to know the day to day of what's up and down and the response times, but they look at these metrics as to the relationship of the business goals. They are interested in the trend reporting and the inter-dependencies among IT infrastructure components. They have this 6th sense, this spatial reasoning ability to provide Foresight into the IT organization. The Architect may not have a say in the staff hiring, but they are certainly aware of the utilization of staff. Where and why it's performing...or more importantly why it's not performing. They are just giving the CIO ammo to fight the good fight for staffing.

One of these Architects who turned consultant is Sean Freeman, CEA. He admits he drank the ManageEngine Coolaid a few years ago...and liked it. He has 20 years experience as an implementor of IT infrastructure at a government contractor and mid / large enterprises. At our past ManageEngine Users Conference, I caught up to him and miked him up.

http://www.manageengine.com/products/eventlog/testimonials.html

One of his first stories is he would schedule a meeting with himself every Tuesday AM. He would pour through the ManageEngine reports from the last week and look for trends. Look at trending vs. real-time stats. How long has an issue been degrading? How long to close tickets? Human error is still the largest contributor to IT problems. He stressed the importance to have a strong change management mechanism. ManageEngine Device Expert helps automate the device configuration changes, but also keep the human intervention in check. Then with Eventlog Analyzer, he was able to find the moment in time to isolate the issue.

Weeks later, I interviewed him again. Beyond the proactive alerting and troubleshooting perspective, he said security was a constant concern. Who is hitting us? How much of a target are we? What's the Risk Exposure?

Part of the architecting process was setting up Service Desk type of services and workflows. He defines how to deal with Change Requests. Who is supposed to be aware of the situation, who is to approve, test and deploy.

Finally, he said it is a matter of taking control of your environment, being accountable and being able to report back to the business units. Excelling at the operational level empowers the strategic level of IT. Full circle.

Friday, October 7, 2011

Data Mining of LTE Performance Management

You don’t need a crystal ball to know that demand for telecommunications and data-bandwidth requirements is exploding. LTE standards address this huge demand for higher bandwidth, lower latency, and advanced communication services. In turn, it’s more important than ever that Element Management Systems (EMSs) and Network Management Systems (NMSs) properly control network devices to ensure calls go through, video gets viewed, online games perform, and more.

From a service provider’s viewpoint, there’s greater competitive pressure on revenue per subscriber. 3G and 4G networks also force you to greatly reduce operational costs (OPEX), meaning the networking devices themselves must be highly intelligent and do less operational work to keep them up and running. LTE network handles more data and more services. It also means more devices. Operators simply can’t throw more people at these problems. So this presents challenges from both an EMS and NMS perspective.

In LTE networks, many devices need to be managed, increasing the potential points of failure or degradation. These devices include a pool of mobility management entity (MME) devices, a serving gateway (SGW) and an eNodeB cluster, in addition to core and backhaul networks.

Once devices are deployed in the network, they broadcast themselves to the Element Management System. These devices tend to be chatty and send lots of data about their physical condition, health, and performance. The EMS implications of these changes are broad. Certainly Fault Management and Configuration Management are affected, but for this article I’ll focus on the impact on Performance Management.

For Performance, mobile providers turn to the 3GPP as their industry standard for the KPIs (Key Performance Indicators) that determine the health of their devices and pinpoint issues that need to be resolved quickly. The data collection mechanics can vary; it could be poll data via SNMP, SOAP/XML or SQL or a custom data sources like CSV files. The EMS aggregates the data and visualizes it in a meaningful way. Common KPI examples are call-session management or call-success/failure rates. This data will be crunched for further QoS and Service Assurance purposes and sent northbound to OSS/BSS systems.

In one case, Viasat has built a Next Gen LTE Satellite System. ViaSat provides a ground Based Beam-Forming (GBBF) system comprised of the CMS (Control and Management System), UBS (Uplink Beacon Station), and Gateway for Boeing's mobile satellite communications, which beam the signals to multiple ground gateways. There are not many ground gateways, but each one generates a massive amount of data to process the analog-to-digital conversion, signal processing and LTE performance KPIs while constantly performing positioning with the satellite beam.

In short, LTE presents unprecedented challenges. This enormous increase in bandwidth traffic means more infrastructure devices and media servers. The amount of health and performance data is immense. This distributed data is mined and analyzed by the EMS to drive the Operators’ business goals … which is making sure the call goes through, the data arrives and the customer is satisfied.

Thursday, August 18, 2011

Performance based IT Shop Part 2

Not all IT problems come under the domain of the network engineer. In my previous Blog, I talked about IT shops having a hodge-podge set of tools. There are various reasons, but the real inefficiency is when these tools perform the same functions. There becomes a time and need to look at the IT problems from different perspectives. A few examples below:

Kenn Nied, Senior Network Engineer at WA State Board for Community and Technical Colleges, illustrates this encounter. While looking at OpManager from a networking point of view, the operator sees alerts that a few switches and a firewall are unresponsive. Is it faulty equipment or an attack? Then turning to a Security mindset, he looks at ManageEngine Device Expert to see real time and historical configuration changes. In one case, it was identified that there was a Firewall rule change made and realized it was a misconfiguration that caused the switch unresponsive. Diagnostic time was minimal.

Albert E. Whale, CHS CISA CISSP, Senior Technology & Security Director for ABS Computer Technology, Inc. explains the security aspect further. When you are managing the security of a business, there are several essential tools needed to manage the environment. There is a need to get a better handle on the design, information flow and stability in the environment. First is a baseline review of all of the Network devices. ManageEngine Device Expert captures the current configuration of the network switches and firewalls. It's an invaluable tool for managing change control on configurations, and also evaluating all of the configurations at a glance. Continuing from the baseline report, both the ManageEngine EventLog Analyzer and Firewall Analyzer determine bottle necks in network throughput and attack information within the Enterprise. Being proactive on security allows for protection before break-ins occur.

Bill Duffy, CTO of Northwind Technology describes the compliance angle. IT departments are faced with compliance oversight irrespective of whether its internal audit and risk management or external regulatory bodies overseeing a particular industry share common goals in meeting these requirements:

* Ability to incorporate aims of compliance reporting into overall monitoring and system administration strategy to optimize technology investments as requirements change and grow.
* Need to reduce the time spent on compliance and audit reporting.
* Use monitoring toolset to proactively manage risk across the organization.
* Demonstrate adherence to compliance controls with clear, objective and easily accessible evidence.

Central to achieving these aims is finding a comprehensive suite of tools that covers all areas of IT security and infrastructure and provides easy access to administrators and auditors. Moreover, it is paramount to provide a rich reporting framework to address ad-hoc and historical data requests as part of evidence gathering during audits. IT departments meeting compliance need to show service availability, IT administration staff activity tracking, change management, asset management, access control, as well as audit trails and logging (security, system, applications, maintenance etc).

The ManageEngine suite of products is unique in being able to effectively bridge the IT landscape to meet these compliance demands. By utilizing ManageEngine ServiceDesk Plus, OpManager and AD Manager Plus as well as modules for AD Audit Plus and Asset Explorer in an integrated fashion, we are able to provide a complete compliance approach streamlined to limit audit and administration burdens on human and system resources while delivering a risk management solution and satisfying audit controls.

Wednesday, August 3, 2011

Performance based IT Shop

Some companies make the insightful IT business decisions because they have the right data, processes and software. Because ManageEngine fits into the software bucket, I’ll address this straight up. It never fails to amaze me is many IT departments have a hodge-podge set of tools. Recently, I ran into a company that is using What’s up, Altiris, some basic MRTG and Tivoli…and it was only giving them up / down status. They also purchased Applications Manager and were liking the results.

I’ve heard this silo, multi-tool story many times. It happens for a variety of reasons. It is usually based on their IT infrastructure maturity level, the evolution their needs at the time or the case of IT decision makers coming and going. The attitude of the moment is: “Got a problem, I’ll solve it!” Reminds me of that line in the Vanilla Ice song...

Management software is not cheap and can cause neck and back problems (swiveling head back and forth to look at multiple consoles). There became a point where they wanted to become a performance-based IT Shop. They learned more about Applications Manager, Added VMWare and Storage monitoring, then added Service Desk to aid in their trouble ticketing and incident and change management processes.

There is a time when intuition based trouble shooting does not scale. Data needs to be collected to get a sense of what’s going on in the IT infrastructure. Then one must identify strength and weaknesses and measure progress against goals and historical data. All of which supports good decision making.

Selecting Metrics to Predict Performance

IT Metrics should be defined to fit the individual need. Not all infrastructures are the same. Within ManageEngine products, one can collect hundreds of arcane metrics. In some cases, IT shops are fire fighting all day and no one is aware of the performance metrics. Managing and controlling the IT metrics has big implications. Downtime and loss of productivity definitely puts a hit on the financial bottom line. Just selecting just a few critical metrics is key to moving toward a performance driven organization. Visit the metrics continually to align with the decision making strategy. Then, make the metrics visible to all to see. Some of our customers put up a dashboard in high traffic areas. People became more aware and active in understanding the goals of the IT strategy, thus making everyone more accountable.

Below are customer examples to drive the point home.

Jamie Gilbert, Director, CIO of CD Baby, the largest online distributor of independent music is using ManageEngine Applications Manager and Service Desk. He said there is an expectation within the organization of no downtime. Uptime metrics and SLA reporting for long term trending for site performance using URL sequence testing is invaluable. Not only performance driven, he also uses it for troubleshooting analysis.

In a previous position, he implemented Applications Manager to monitor 450 real and virtual servers in a mixed Windows and Linux environment with MS-SQL and Oracle databases. He experienced issues with a new application running on Apache, tomcat, and Java. While using real time performance reporting in Applications Manager along with long term trending and comparative analysis reporting across servers, they were able to hone in on the root cause of the issue. The root cause ended up being application programming issue in conjunction with tomcat connection limitations and JVM memory allocation. It was a multifaceted problem and Applications Manager made it possible to see the problems very easily and allowed the team to come up with a path to resolution.

Darren Qualls, CTO of Premier Global Technologies, user of ManageEngine IT360 explains database performance this way. Slow performing databases can be extremely tricky to chase down. An example would be a 9 Terabytes of SQL server data and throw a $20k piece of hardware at it. The likelihood is you’re still going to have problems. There are a few common issues you will run into with database servers. In most cases, you will want to start with lock waits. This is one of the standard metrics for any product. There are so many ways you can mess up record locking and not even know it for a year or two.

In 90% of the cases, record lock issues are only a drop in the bucket. The next thing I run into is the disks. Slow disk access will take a half million dollar blade System to its knees EVERY time! There are so many things that I have to categorically rate as self imposed; incorrect normalization of data, bugs in code, incorrect commit placement or parameters, etc. Even underpowered hardware with incorrect initial specs, organic growth, expired systems will cause problems. Another is telecom issues that can be anything that revolves externally around the system, network setup or remote pulls on queries for reports.

These are your common 3 server setups you’ll need for network maps and traffic monitoring to isolate the data to determine the issue. Do not skimp, without it you may end up taking about 3 times the effort to resolve it.