The problem nobody talks about

MetaTrader 5 crashes. Not often, and not always for the same reason, but it crashes. A bad tick, a third-party indicator allocating too much memory, a Windows update that restarts the terminal service, a VPS provider that reboots the machine at 3 AM without warning. For someone running a manual portfolio, a crash means logging back in. For someone running automated strategies in production, a crash can mean unprotected open positions, missed entries, or hours of unmonitored exposure until the next time you check.

The obvious response is to write a watchdog: a second process that monitors the first and restarts it if it goes down. This is basic systems engineering, and it works until it doesn't. The question that exposes the fragility is simple: what watches the watchdog?

A single watchdog process is a single point of failure. If the watchdog crashes, hangs, or gets killed by the OS, the monitoring stops silently. The operator has no way of knowing that their safety net disappeared until the next crash actually happens and nothing recovers it. This is worse than having no watchdog at all, because the operator is operating under the assumption that recovery will happen.

This article describes the five-layer architecture we designed to solve this problem. Each layer is independent, runs on a different mechanism, and can recover from the failure of any other layer. The system was built specifically for MetaTrader 5 on Windows, but the architectural principles apply to any critical process that needs to stay alive on a commodity operating system.

Layer 1: The main monitoring loop

The first layer is the core engine. It runs as a foreground process and performs the primary monitoring cycle: checking whether each configured MetaTrader 5 terminal is running, verifying that the process is not frozen, reading heartbeat files written by an EA inside the terminal, and initiating recovery actions if something is wrong.

This loop executes continuously with a configurable interval. Each cycle produces a health score based on the state of all monitored terminals and programs. If a terminal is found to be down, the engine restarts it. If a terminal is alive but the EA heartbeat file is stale beyond a configurable threshold, the engine flags it as unresponsive. Every state change triggers notifications via Telegram and email if the operator has configured them.

The main engine is the most capable layer. It handles crash detection, process restart, external program monitoring, equity tracking, broker disconnect detection, and report generation. But it is also the most complex layer, and complexity is where bugs live. A sufficiently unusual error could bring this process down. The remaining four layers exist specifically for that scenario.

Layer 2: The redundant heartbeat monitor

Layer 2 runs as a lightweight thread inside the same process as Layer 1 but operates on an independent timer and independent logic. Its only job is to verify that the main monitoring loop is still executing. It does this by checking a timestamp that the main loop writes after each successful cycle. If the timestamp is older than a configurable staleness threshold, Layer 2 concludes that the main loop has hung and takes corrective action.

The corrective action is intentionally limited: Layer 2 does not try to replace the main engine. It writes an error state, triggers a notification if possible, and logs the issue. The assumption is that a hung main loop is a symptom of something deeper that requires a full process restart, which is the job of Layer 3.

Why not make Layer 2 a separate process? Because having it inside the same process guarantees that it shares the exact same lifecycle. If the process crashes, both Layer 1 and Layer 2 die together, and the responsibility falls to the external layers (3, 4, 5). If the process is alive but the main loop is hung, Layer 2 is alive and can detect the hang. This covers the specific failure mode where the process is still running but the main logic is stuck.

Layer 3: The supervisor process

Layer 3 is a separate, independent process. It has one job: verify that the Layer 1 + Layer 2 process is alive and healthy. It does this by periodically reading a heartbeat file that the main process writes to disk. If the file is missing or stale, Layer 3 restarts the main process.

The supervisor is deliberately minimal. It has no knowledge of MetaTrader 5, no knowledge of terminals, no notification capability. Its only dependency is the filesystem and the ability to start a new process. This minimality is a feature: the less code the supervisor has, the fewer things can go wrong with it.

The critical design decision here is that the supervisor never stops running even if the main process is healthy. It does not wait for the main process to crash before it starts monitoring. It runs from system boot and checks continuously. This means the supervisor is already in place when the main process starts, not the other way around.

Layer 4: The OS-level watchdog

Layer 4 is not a custom process at all. It is the Windows Task Scheduler, configured to launch the main application at system logon. If the machine reboots unexpectedly (power failure, VPS restart, Windows Update reboot, blue screen), Task Scheduler will automatically start the monitoring system the moment the user session begins.

We evaluated several alternatives before settling on Task Scheduler. NSSM (Non-Sucking Service Manager) can wrap any executable as a Windows service, which gives automatic restart on crash and boot-time startup. However, Windows services run in Session 0, which has no GUI access. MetaTrader 5 is a GUI application that requires a desktop session to function. Running MT5 as a service or from a service creates hard-to-debug interaction problems: the terminal may launch but fail to connect to the broker, display charts incorrectly, or simply hang because it cannot access the desktop.

WinSW (Windows Service Wrapper) has the same Session 0 limitation. We tested both and discarded both permanently. The combination of Task Scheduler with the ONLOGON trigger solves the boot-time recovery requirement without introducing the service session problem.

The Task Scheduler task is configured with specific settings that matter for reliability. It runs only when the user is logged on (which is always the case on a VPS or dedicated trading machine). It does not stop if the task runs longer than a certain time. And it has a delay of a few seconds after logon to allow the Windows desktop to fully initialize before launching the monitoring stack.

Layer 5: The auto-logon guarantee

Layer 5 is not software. It is a configuration: Windows auto-logon. This ensures that after any reboot, the user session starts automatically without waiting for someone to type a password. Combined with Layer 4 (Task Scheduler), this means the full monitoring stack comes back online after any reboot, regardless of whether the operator is physically present or awake.

The auto-logon configuration is stored in the Windows Registry. It can be set manually via netplwiz or programmatically. For VPS environments, most providers configure auto-logon by default. For physical machines, it needs to be set explicitly.

This layer sounds trivial, but it is the one that operators forget most often. A machine that auto-reboots after a power failure but then sits at the Windows login screen, waiting for human input, defeats every other layer. The entire five-layer architecture is designed so that the only human action required after a full system failure is nothing. The machine comes back, the session logs in, the scheduler fires, the supervisor starts, the engine starts, the terminals restart, and monitoring resumes.

Why five layers instead of two or three

Every layer covers a different failure mode:

  • Layer 1 fails (main loop hangs): Layer 2 detects it.
  • Layer 1 + 2 fail (process crashes): Layer 3 (supervisor) restarts the process.
  • Layer 1 + 2 + 3 fail (all processes die, e.g., system reboot): Layer 4 (Task Scheduler) relaunches everything at logon.
  • Layer 4 can't fire (no user session after reboot): Layer 5 (auto-logon) ensures the session exists.

No single failure, and no combination of two simultaneous failures, can prevent recovery. You would need three independent things to go wrong simultaneously: the main process, the supervisor, and the Task Scheduler all failing at the same time on a machine that also lost its auto-logon configuration. That is a scenario so unlikely that it exits the domain of software engineering and enters the domain of hardware replacement.

The EA heartbeat: closing the visibility gap

There is a monitoring gap that the five layers do not cover by themselves: the internal state of MetaTrader 5. A terminal can be running, consuming CPU, accepting connections, and appearing completely healthy to the operating system while internally being in a state where it is not processing ticks, not executing EAs, or not connected to the broker.

To close this gap, a lightweight Expert Advisor runs inside each monitored terminal. This EA does nothing except write a JSON file to disk every 15 seconds containing the terminal's account state, connection status, equity, open positions, and the status of every EA running on every chart. The monitoring engine reads this file and uses it to determine the terminal's internal health.

If the file stops being written, the terminal is alive but not processing. If the file is written but reports no broker connection, the terminal has a connectivity issue. If the file is written but certain EAs have stopped responding, the operator is alerted to a specific EA problem. This heartbeat mechanism transforms a binary "running or not running" check into a rich assessment of the terminal's operational state.

What we discarded along the way

Building this system involved testing and discarding several approaches:

Windows services via NSSM and WinSW. As described above, these do not work well with GUI applications like MetaTrader 5. The Session 0 isolation problem is fundamental and cannot be worked around cleanly. Discarded permanently.

Named pipes and shared memory for inter-process communication. We initially considered using named pipes between the main engine and the supervisor. In practice, filesystem-based heartbeat files are simpler, more debuggable, and sufficient for the communication bandwidth required (one status update per cycle). Named pipes add complexity without proportional benefit. Discarded in favor of file-based heartbeats.

Polling intervals shorter than 10 seconds. Aggressive polling generates noise. A terminal that takes 8 seconds to initialize after a restart should not trigger a false alarm because the watchdog checked at second 5. The monitoring interval is calibrated to balance detection speed with false positive avoidance. For most configurations, a 15-second cycle provides detection within 30 seconds of a failure, which is fast enough for any practical trading scenario.

Automatic strategy management. We explicitly chose not to have the watchdog interact with trading logic. It does not pause EAs, close positions, or modify risk parameters. The watchdog manages infrastructure. The operator manages strategy. Mixing these responsibilities would make the watchdog a liability instead of a safety net.

Operational results

After running this architecture in production for several months, we can report the following observations. The five-layer system has successfully recovered from every terminal crash, every VPS reboot, and every Windows Update restart without human intervention. Recovery time from a terminal crash to a fully operational terminal with all EAs running is consistently under 30 seconds, including the time to detect the failure, restart the process, and wait for the terminal to reconnect to the broker.

The supervisor (Layer 3) has been restarted by the scheduler (Layer 4) exactly twice during the testing period, both times due to edge cases in the process lifecycle that have since been fixed. Layer 5 (auto-logon) has been the recovery path once, after a VPS provider rebooted the machine for maintenance at 04:00 local time. The machine came back, logged in, started the scheduler, started the supervisor, started the engine, restarted the terminals, and the operator received a Telegram notification that everything was back online. The operator was asleep and did not know until the morning.

This is the goal. Not that failures never happen, but that failures resolve themselves before they require attention.

Implications for your own infrastructure

If you run automated strategies on MetaTrader 5, or on any other platform that runs as a Windows GUI application, the principles from this architecture apply directly. The specific implementation details will vary, but the questions to ask are the same:

  1. What monitors the terminal?
  2. What monitors the monitor?
  3. What happens if both die?
  4. What happens after a reboot?
  5. Does any step require a human to be present?

If the answer to question 5 is yes, you have a gap. Fill it.