Planning for Fault Tolerance

Fault tolerance, or “high availability,” is critical to any successful business operation. To ensure that requests are processed in the event of failure, FME Server supports configuring fault tolerance throughout the multiple levels of an integrated system. FME Server provides fault tolerance in the following ways:

  1. Recovery: Restarting components and jobs when crashes occur. FME Server provides component and job recovery automatically - no additional planning is needed.
  2. Redundancy: Ensuring there is no single point of failure. As of FME Server 2018 a new architecture ensures that when two FME Server systems are configured together, fault tolerance is achieved automatically.

Fault Tolerance Configurations

FME Server's fault tolerant configuration is an Active-Active installation. As of FME 2018 Active-Passive environments are no longer supported. Active-Active fault tolerance requires a minimum of two systems, with FME Server installed on both systems. Each system uses the same FME Server System Share and the same FME Server System Database.

An additional requirement is a Load Balancer that monitors system health and distributes web requests between the FME Server environments. All the FME Server hosts are active at the same time.

Recovery

Component Recovery

FME Server comes out-of-the-box with component recovery. This means that, even on a single system, FME Server monitors and restarts components that fail, including the FME Engines and the FME Server Core. This is achieved through the FME Server Process Monitor. The ability for FME Server to monitor its own components ensures reliable uptime and dependability.

Job Recovery

FME Server also includes the ability to restart a job when a crash occurs. As a result, jobs that experience temporary issues, such as a network glitch, are re-submitted and run again.

After FME Server submits a translation request to an FME Engine, it monitors the connection to that engine until a response is returned.

FME Server can resubmit a failed job if:

  • The connection to the engine is lost.
  • The engine crashes.

FME Server continues to re-submit a translation up to a specified number of attempts. To prevent FME Server from indefinitely retrying a job that fails, the default setting is to resubmit a failed job up to three times. This setting is configurable and can be turned off entirely.


WARNING
A failed translation request may sometimes cause an FME Engine to shut down improperly. If no maximum limit is imposed, the translation is resent indefinitely, potentially causing indefinite FME Engine failures.

If a job is resubmitted because of a failure, but subsequently succeeds, then the first job log file is overwritten. This hides the cause of the original job failure. This is vary rare, but this possibility is one reason you may wish to set job resubmission to zero.

Re-submitted transactions may also cause data duplication, such as when writing to database formats or when writing mid-translation with the FeatureWriter.

Redundancy

The goal of a fault tolerance environment is to remove single points of failure, so that a single component can fail, without forcing the entire system offline. This is achieved by having FME Server installed on multiple systems, each pointing to the same FME Server System Database and FME Server System Share.

The new fault tolerant architecture, at its simplest implementation, duplicates most of the FME Server components on separate servers. Additional systems are configured similarly and provide the same functionality. A third-party load balancer directs incoming traffic to either of the available systems. There is no stickiness required for the client sessions. Requests are directed to any of the systems.

The following image shows 2 deployment examples: the recommended approach and a fully distributed deployment. By following the recommended approach you gain the benefits of fault tolerance with the minimum number of systems.


Basic Architecture Requirements

  • Load Balancer System
  • FME Server Components (minimum two systems)
  • Fault Tolerant Database
  • Fault Tolerant File System

Benefits

  • Simple to manage
  • Fewer systems required.
  • Can increase number of engines available on each system
  • Easy to add additional systems to increase capacity


Distributed Architecture Requirements

  • Load Balancer System
  • FME Server Web (minimum two systems)
  • FME Server Core (minimum two systems)
  • FME Server Engine (minimum two systems)
  • Fault Tolerant Database
  • Fault Tolerant File System

Benefits

  • Allows for use of own Web Servlet and thus security updates without disrupting other systems
  • Allows engines to be deployed easily with 3rd party Software
  • Finer control for scaling each system's capabilities (memory, CPU, disk space)

TIP
In a fault tolerant installation of FME Server the UDP Publisher/Trigger and SMTP Publisher/Trigger are not supported. To receive e-mail notifications, consider the receive email (IMAP) trigger instead.

Load Balancer System

The customer must provide their own load balancer (LB) and this can be configured to point to FME Server and perform regular health checks (if supported). The LB can also use timeouts to redirect requests to another FME Server system.

FME Server Components

It is recommended to install the FME Server Web Application, the FME Server Core, and FME Server Engines (optional) on a single system and repeat this for a second system (see image above 'RECOMMENDED'). This provides a basic fault tolerant environment. The LB would then be directed to point to these two systems.

Further, similar additional systems can be added to the environment to expand the high availability. Systems with only FME Server Engines can also be registered with the FME Server Cores to increase the engines available, and to distribute the processing across more systems.

The FME Server cores become aware of each other and will handle requests. There will be one Job Manager, and if this fails, the other Job Manager on the other system will take over and handle job requests. There should be minimal downtime when a core goes done. Allow a few moments (1-2 minutes) depending on the LB configuration.

Schedules will continue to operate normally.

Fault Tolerant Database

The customer is in charge of making the Database fault tolerant.

Fault Tolerant File System

The customer is in charge of making the File System fault tolerant.

Tracking Core Failures

A failed system can be investigated while the second active system provides continued operation of FME Server. Once the failed system is recovered and started, it will rejoin the environment seamlessly.

The types of failures that typically cause faults are hardware and operating system crashes, in which the system fails completely.

Log files must be reviewed on the affected system to understand why the FME Server core failed. When the core's availability is affected, the outcome is usually an unusable system.


FME Lizard says...
In the past, clients of Notification Service publishers did not failover, but in 2019.0 this will also occur.

Failover Scenarios

Within a fault tolerant system, there are two main failover scenarios that can occur.

The first is where the engines are installed alongside the core. In the event a server hosting both a core and an engine goes down, the remaining core will resubmit any failed jobs, and continue to send new jobs, to the available engines. The number of engines on the remaining machine will not increase automatically. However, you may - if you wish - manually increase the number of engines on a machine after a failover event, from the Web interface, assuming that the hardware can handle the additional workload.

The second scenario is a core with a distributed engine on a separate machine. When a failover occurs on the core machine, the distributed engine will reconnect to the remaining active core and continue processing.

results matching ""

    No results matching ""