# Monitoring workflow activity on Conductor Live
<a name="monitoring"></a>

This topic covers the operations you use to run and manage nodes in AWS Elemental Conductor Live.

**Topics**
+ [Monitoring channels](monitoring-channels.md)
+ [Monitoring MPTS outputs](monitoring-mpts-outputs.md)
+ [Monitoring alerts and messages](monitoring-alerts-and-messages.md)
+ [Monitoring nodes](monitoring-conductor-and-worker-nodes.md)
+ [Monitoring the load on worker nodes](monitoring-the-load-on-worker-nodes.md)

# Monitoring channels
<a name="monitoring-channels"></a>

**Topics**
+ [Monitoring the health of channels](#monitoring-for-failure)
+ [Monitoring channel activity at the node](#monitoring-channel-activity-at-the-node)
+ [Viewing channel history](#viewing-channel-history)

## Monitoring the health of channels
<a name="monitoring-for-failure"></a>

You can monitor the status of channels as they run. 

1. In the AWS Elemental Conductor Live main menu, choose **Channels **. Information is color-coded as follows:
   + Yellow background shading indicates that there are active alerts on the channel that you have not yet read and suppressed. 
   + Red background shading indicates that the status of the channel is Error.

1. Display more information if you want:
   + Choose any red icon to go to the **Status – Messages** page. This page shows all messages for this channel. The error message is shaded red and have the same red icon.
   + Choose any orange icon to go to the **Status – Alerts** page. This page shows detailed information about any alerts for this channel.

## Monitoring channel activity at the node
<a name="monitoring-channel-activity-at-the-node"></a>

You can view information about the channel activity that is happening at any worker node. 

1. In the Conductor Live main menu, choose **Channels **. 

1. Select any channel by its ID or name. The **Channels Details** page appears. 

When a channel is running, information appears in three tabs: **Status**, **Parameters**, and **Logs**. 

Elemental Live constantly forwards `_eme` and `_ eme_ve` logs to Conductor Live.

Note that channel logs are displayed for 24 hours. Logs that are from 24 hours to one-week old are held in zip files that you can unzip if needed.

## Viewing channel history
<a name="viewing-channel-history"></a>

You can view a summarized history of the channel.

On the **Channel Details** page, choose **History**.

Time is shown in the time zone currently configured on the Conductor Live. The timeline captures when a channel gets created, started, and stopped, and also includes node information. 

# Monitoring MPTS outputs
<a name="monitoring-mpts-outputs"></a>

**Topics**
+ [Monitoring the progress of all MPTSes](#monitoring-the-progress-of-all-mpts-outputs)
+ [Monitoring the muxing performance of an MPTS](#monitoring-the-muxing-performance)
+ [Modifying the MPTS while it is running](#modifying-the-mpts-output-while-it-is-running)

## Monitoring the progress of all MPTSes
<a name="monitoring-the-progress-of-all-mpts-outputs"></a>

You should monitor every MPTS constantly to ensure that none has failed.

1. On the AWS Elemental Conductor Live main menu, choose **Status**, then choose **Overview**. Information about the MPTS appears.

1. If at least one MPTS is in error, go to the **MPTS** page.

1. Look for any MPTS that has an orange icon in the **Status** column. 

1. Choose an orange icon to go to the **Status – Alerts & Messages** page. This page appears with the filter set to show only the information for this MPTS output.

1. Review the alerts and messages to determine why the MPTS failed.

## Monitoring the muxing performance of an MPTS
<a name="monitoring-the-muxing-performance"></a>

To monitor the muxing of an individual MPTS, display the **MPTS** page and choose **Performance** (graph icon) beside the item.The MPTS** Details** appears with the **Performance** tab on top. 

## Modifying the MPTS while it is running
<a name="modifying-the-mpts-output-while-it-is-running"></a>

You can modify the MPTS ouput even when it is running:
+ You can modify its properties. 
+ To add or remove channels. 

See [Modifying an MPTS](modifying-an-mpts.md).

# Monitoring alerts and messages
<a name="monitoring-alerts-and-messages"></a>

AWS Elemental Conductor Live generates alerts and messages to provide information about the status of the nodes in the cluster and about the encoding channels. This section covers how to monitor alerts and messages via the web interface. 

For information about setting up automatic email or web callback alert notifications, and about using the SNMP and REST interfaces for alerts and messages, see [AWS Elemental Conductor Live Configuration Guide](https://docs.aws.amazon.com/elemental-cl3/latest/configguide/). 

**Topics**
+ [About alerts and messages](#about-alerts-and-messages)
+ [Alerts and messages on the web interface](#alerts-and-messages-on-the-web-interface)

## About alerts and messages
<a name="about-alerts-and-messages"></a>

In the following table, read down the first column to find the type of information that you're interested in. Then read across to find the interfaces that provide alerts about that information and that provide messages about that information.


| Type of information | Alerts | Messages | 
| --- | --- | --- | 
| Access Options |  Web Interface  Automatic email notification Web callback notification SNMP trap SNMP poll REST calls  |  Web Interface  SNMP poll REST calls  | 
| Information Conveyed |  Alerts are feedback on a problem that must be fixed. The **Channel Error** alert informs you that a channel has moved to an Error state.  This can help when you are receiving automatic email notifications, to let you know to check for related messages on the web interface.  |   Three types of messages are: **AuditMessage****:** Informational messages that you do not need to react to. Often, these messages are feedback to actions you performed. **WarningMessage****:** Messages that advise you that there is a risk that a future activity will fail unless you take action to prevent it. **ErrorMessage**: Messages that indicate that a planned activity has failed or an unexpected system error has occurred.  | 
| Active/Inactive | Alerts remain active until the underlying problem is resolved. When the cause of the alert is no longer present, the system clears the alert so that it becomes inactive. | Messages are neither active nor inactive. They are defined as “recent” when they are fewer than 24 hours old. | 
| Visibility |  You can toggle the visibility of active alerts. Suppressing an alert this way is similar to marking an email as read.  The section below describes where you can see suppressed and unsuppressed alerts on the web interface.  |  Only messages of the type **Error** are visible in the header. You can toggle the visibility of recent error messages, which is similar to marking an email as **read**.  The section below describes where you can see suppressed and unsuppressed messages on the web interface.  | 

## Alerts and messages on the web interface
<a name="alerts-and-messages-on-the-web-interface"></a>

Conductor Live provides information about alerts and messages in two places:
+ On the header of every page.
+ In more detail on the pages **Status – Alerts** and **Status – Messages**. 

### Web interface header: Alerts
<a name="web-interface-header-alerts"></a>

The web interface header, located at the top of all pages of the web interface, shows a count of alerts that are both active and visible:
+ Active: the condition that is causing the alert still exists.
+ Visible: no user has marked the alert as *read*. 

The count is in a red circle to the right of the information (**i**) icon.

1. Select the red circle to display a pop-up list of the ten most recent active, visible alerts.

1. Optionally, choose the suppress (**x**) icon to dismiss this alert. The alert will remain active until the underlying cause is resolved. It won't appear in the popup list. But it is still listed in the **Status – Alerts** page, under the **Active** tab.

   You can unsuppress the alert on the **Status – Messages** page.

### Web interface header: Messages
<a name="web-interface-header-messages"></a>

The web interface header, located at the top of all pages of the web interface, shows a count of error messages that are both recent and visible:
+ Active: the messages was created in the last 24 hours.
+ Visible: no user has marked the alert as *read*. 

The count is in a red circle to the right of the information (**i**) icon.

1. Select the red circle to display a pop-up list of the ten most recent, visible alerts.

1. Optionally, choose the **Suppress** (**x**) icon to dismiss this alert. The alert will remain active until the underlying cause is resolved. It won't appear in the popup list. But it is still listed in the **Status – Alerts** page, under the **Active** tab.

   You can unsuppress the alert on the **Status – Messages** page.

### Status – Alerts page
<a name="status-alerts"></a>

On the Conductor Live main menu, choose **Status**. Then choose **Alerts** in the left panel.

The **Alerts** page contains three tabs, for active, inactive, and all alerts.

Each tab shows the same information: 
+ The unique code for the alert
+ The type and the message wording
+ Whether the alert is visible. On the active tab, you can select this icon to change the alert between visible and invisible.
+ The node and associate for this alert. The association identifies the target of the alert, for example, a channel.

You can choose the **Alert Filters** button at the top right corner to filter alerts.

### Status - Messages page
<a name="status-messages-screen"></a>

On the Conductor Live main menu, choose **Status**. Then choose **Messages** in the left panel.

The page shows the following information: 
+ The unique code for the message. 
+ The type. Messages are error messages, warning messages, and audit messages.
+ Error messages have red shading and a red triangle icon. Only error messages are included in the message count on the web interface header.
+ The message wording.
+ Whether the message is visible. You can select this icon on an error message, to change the message between visible and invisible.
+ The node and associate for this message. The association identifies the target of the message, for example, a channel.

You can choose the **Message Filters** button at the top right corner to filter messages.

# Monitoring nodes
<a name="monitoring-conductor-and-worker-nodes"></a>

You should monitor the nodes regularly to ensure that they are still all online. 

1. On the Conductor Live main menu, choose **Status**, then choose the **Overview** tab. 

   This page shows a summary of the status of the nodes, channels, and MPTSes in the cluster.

1. If a node is shown as failed or offline, you can obtain more information. On the Conductor Live main menu, choose **Cluster**, then choose **Nodes**.

1. To identify the problem node or nodes, look for nodes that have a red or yellow background and an orange icon in the Status column. 

1. Choose an orange icon to go to the **Status** - **Alerts & Messages** page to display detailed information about the alerts and messages. The **Alerts & Message**s page appears with the filter set to show only the information for this channel.

1. Review the alerts and messages to determine why the channel failed.

1. For detailed information on dealing with problems, see the following topics. 

**Topics**
+ [Offline nodes](#offline-nodes)
+ [Failed worker nodes with worker redundancy](#failed-worker-nodes-with-worker-redundancy)
+ [Failed worker nodes without worker redundancy](#failed-worker-nodes-without-worker-redundancy)
+ [Failed Conductor Live nodes with Conductor Live redundancy](#failed-conductor-nodes-with-conductor-redundancy)
+ [Failed Conductor Live nodes without Conductor Live redundancy](#failed-conductor-nodes-without-conductor-redundancy)

## Offline nodes
<a name="offline-nodes"></a>

Investigate an offline node if you were not expecting nodes to be offline. Try to determine why the node has been taken offline (speak to other engineers and operators) and, if necessary, take steps to bring the node back online. 

## Failed worker nodes with worker redundancy
<a name="failed-worker-nodes-with-worker-redundancy"></a>

When worker redundancy is implemented on the cluster and a node switches to the failed status, any channels that are running on the worker node move to a backup node, as described in “How Worker Node Failover Occurs” below. 

### Setting up for Notification
<a name="setting-up-for-notification"></a>

We recommend that you set up Conductor Live so that it sends you an email or hits your webserver when the following alerts or messages are raised: 
+ 4009
+ 4010
+ 4018

See the [AWS Elemental Conductor Live Configuration Guide](https://docs.aws.amazon.com/elemental-cl3/latest/configguide/) for information on setting up notifications.

### Dealing with a failed node
<a name="immediate-action"></a>

When a node goes to failed, follow this procedure to deal with the failed node and with the redundancy setup. 

1. Go to the **Cluster** - **Redundancy** page and look for the redundancy group that the failed node belongs to: choose each group in the **Redundancy Groups** section and look for the node in the **Active Nodes** tab and the **Backup Nodes** tab. 

   If the node appears in the **Backup Nodes** tab, see *If a Reserve Node Fails*, below. Otherwise, continue this procedure. 

1. Verify if there is still at least one node listed in the Backup tab.
   + If yes, then there is no immediate need to deal with the failed node, but you should still deal with it in a timely manner. 
   + If not, you can assume that when the failed node failed over, it used up the last of your backup nodes. You should solve the problem on the failed node as soon as possible and bring it back into service so that you can get back to the state of having at least one backup node. 

     You receive an alert if you have a redundancy group set up but do not have any backup nodes available. 

1. To investigate the failed node (either now or later): 
   + Go to the **Status** - **Nodes **page. The node should have an orange icon in the **Status** column. Choose this icon; the **Status** - **Alerts & Messages** page appears, filtered to show only the information for that node. 
   + Review the alerts and messages to determine why the node failed. 

1. Make sure you have the desired number of backup nodes set up. 

### How worker node failover occurs
<a name="how-worker-node-failover-occurs"></a>

1. Conductor Live determines the action to attempt:
   + If the node was online/idle before it failed, Conductor Live takes no fail over action. The node simply goes to the failed status.
   + If the node was online/running, Conductor Live attempts to failover this node to one of the reserve nodes, as described in the following steps.

1. Conductor Live identifies the redundancy group that the failed node belongs to and selects a reserve node in that group.

1. Conductor Live then attempts to move all channels (in the case of a failed Elemental Live node) or MPTSes (in the case of a failed Elemental Statmux node) to node\$1Y and restart the previously running channels or MPTS outputs on this new node. The role for node\$1Y changes from **reserve** to **active**. This node is no longer eligible to be selected as a failover node if another active node fails.

### If a reserve node fails
<a name="if-a-reserve-node-fails"></a>

If a reserve node fails when it is currently in reserve, it stays as a reserve node but its status changes to **offline**. 

If a reserve node switches to **active** and then fails, it will be eligible to fail over to another reserve, in the same way as any other active node is eligible. 

### When a failed node recovers
<a name="when-a-failed-node-recovers"></a>

When a node that is failed is brought back into service, it returns to the status it had when it failed: Active or Backup. 

### Dealing with a false failure
<a name="false-failure"></a>

Conductor Live may determine that node\$1X has failed, when in fact it has only become disconnected from the management network (and is continuing to run channels) but has not shut down. 

Meanwhile, because Conductor Live has determined that a failure has occurred, it attempts to perform a fail over. The fail over routine does not include any attempt to stop the channels running on node\$1X. If the fail over succeeds, the channels are running on both node\$1X and the fail over node.

However, if the network connection is later re-established (so that Conductor Live can now view activity on node\$1X), Conductor Live attempts to shut down the channels or MPTSes that are running there.

### If a node does not fail over
<a name="if-a-node-does-not-fail-over"></a>

If a node fails but there is no reserve node ready to take over for it, the node remains active/offline. When the node problem is resolved and the node goes back online, it still has its original channels. Channels that were running before the failure start running again.

### Monitoring the distribution of nodes in a redundancy group
<a name="monitoring-the-distribution-of-nodes"></a>

After a fail over, you should check the state of the redundancy group and take steps to ensure that the distribution of active nodes to reserve nodes matches the desired redundancy type (distribution of active versus backup nodes). 

For example, you need to make sure that there is always at least one reserve node in each redundancy group. Each time a node fails, a reserve node switches to **active**. It is possible for all nodes to become active, in which case you need to re-assign at least one node to reserve in order to be prepared for a possible new fail over.

On the Redundancy page, make sure that the Redundancy type has a non-zero number as the second number:

### Redundancy Status Alert
<a name="redundancy-status-alert"></a>

Alerts are raised if a redundancy group has one or more active, online nodes but has no backup, online nodes. The alert persists until a node is restored to a backup role, or a node without channels is manually moved to a backup role.

For more information about alerts and messages, see [Monitoring alerts and messages](monitoring-alerts-and-messages.md).

## Failed worker nodes without worker redundancy
<a name="failed-worker-nodes-without-worker-redundancy"></a>

When worker redundancy is not implemented on the cluster and a worker node has failed, you must do the following:
+ Determine if failure of the node has caused channels to fail and then take steps to re-start those failed channels.
+ Deal with the problem node.

**To troubleshoot nodes**

1. Go to the **Channels** page and determine if any channels have failed. If they have, then move the channels to other nodes as soon as possible. See [Modifying a channel](modifying-a-channel.md) and change the associated node. 

1. Go to the **Status** - **Nodes** page. The node should have an orange icon in the **Status** column. Choose this icon; the **Status** - **Alerts & Messages** page appears, filtered to show only the information for that node. 

1. Review the alerts and messages to determine why the node failed.

1. Take the necessary steps to resolve the problem and bring the node back into service.

## Failed Conductor Live nodes with Conductor Live redundancy
<a name="failed-conductor-nodes-with-conductor-redundancy"></a>

When you have redundant Conductor Live nodes set up and the primary node fails, the secondart node automatically takes over management of the cluster. This change in role takes a few seconds. 

If you resolve the problem with the failed primary Conductor Live node and bring it back into the cluster, that primary node will take back the leadership role from the secondary Conductor Live node.

## Failed Conductor Live nodes without Conductor Live redundancy
<a name="failed-conductor-nodes-without-conductor-redundancy"></a>

When your cluster has only one Conductor Live node, then when it fails, you are not able to use Conductor Live to control worker nodes. The worker nodes are not affected by the Conductor Live node failure.

**To troubleshoot a Conductor Live node**

1. Go to the **Status** - **Nodes** page. The Conductor Live node should have an orange icon in the **Status** column. Choose this icon; the **Status** - **Alerts & Messages** page appears, filtered to show only the information for that node. 

1. Review the alerts and messages to determine why the node failed.

1. Take the necessary steps to resolve the problem and bring the node back into service.

**Returning a Node from Failure**

When the Conductor Live node comes back online, it automatically takes over management of the cluster again. It brings itself up to date in terms of activity and status of all the nodes:
+ If an alert was active when the node failed and the problem no longer exists, the alert is automatically cleared.
+ If an alert was active and the problem still exists, the alert is not cleared.
+ If a problem occurred on a worker node while the Conductor Live node was offline, the Conductor Live now detects this problem and displays a new alert or message.
+ If a problem occurred and got resolved on a worker node while the Conductor node was offline, the Conductor Live has no knowledge of that problem ever having existed. This is really the only missing information from the outage.

# Monitoring the load on worker nodes
<a name="monitoring-the-load-on-worker-nodes"></a>

You can view information about the overall load on any worker node in an AWS Elemental Conductor Live cluster.

On the **Nodes** page, choose the hostname of the node. (Don't choose the IP Address. Doing so will open the web interface for that node in another tab.)

The **Node Details** page appears for that node, showing these charts:
+ Bandwidth
+ CPU Utilization
+ Disk
+ GPU Frames/Second (if the node is running GPU-enabled software)
+ GPU Temperature (if the node is running GPU-enabled software)
+ Memory
+  Realtime
+ Total Frames/Second