Overview of common application problems
 
Data Integrity
High Availability
Self-Monitoring
Software-Only
Low Latency
High Performance
Online Configuration Change
Built-In Testing
Technologies Supported
Add Data Replication to Your Distributed Application
learn more
 
Serious Complexities are Involved with Data Replication
learn more
Product Summary: Echo - Data Replication

Ventura’s Echo Data Replication Product has simple APIs which make it very easy to integrate robust data replication into your existing embedded, distributed application.

Almost every system that uses replicated data must be able to change the sets of machines used for the replicas. The potential for failures and other perturbations is significant during this process especially as the system becomes more distributed because of the increase in messages/data and complexity regarding configuration changes. (see our challenges section.)

The Echo data replication product is used to maintain replicas of application data, including important configuration or management data. It offers a unique combination of capabilities. The key goal of a replication protocol is to ensure that the different copies of data always remain synchronized with each other and do not "diverge" or become corrupted in any way. Echo uses a non-blocking multi-phase commit protocol to ensure data integrity in all cases.

This protocol provides online quorum management that guarantees that at most one partition of the system will think it has a "quorum." Only that part of the system is allowed to continue making changes to the replicated data. The other parts are informed that they do not have a quorum and may not proceed until they can again communicate with the rest of the machines. In addition, Echo-based applications can assign different weights to machines to reflect their importance to the quorum (this is called dynamic linear voting).

Quorum management technology is used so that as data is being replicated, failed machines may be replaced and new ones added to systems in the field. It incorporates a resynchronization protocol that ensures on connection or reconnection to a system, all machines will be brought fully up to date. In addition, if any transactions are in flight when a configuration or other change happens, it ensures that all operations are properly completed in the new configuration. It uses a journaling mechanism to carefully record all information to disk or other persistent storage so that adequate copies of each operation have been replicated prior to completion.

Finally, Echo leverages Ventura’s Pulse communication engine which provides for ease-of-use, security, auto-configuration, scalability, small footprint, portability, standards compliance, and auto-discovery.

 

Echo Feature Breakdown

 

• Data integrity.
Echo uses a multi-phase commit protocol to guarantee that the different copies of data, including essential configuration data, remain exactly the same. This holds when there are machine failures, network outages, network partitions, full site failures -- and even when multiple failure events happen at the same time. Other approaches to data replication are subject to errors in the presence of network partitions, full-site failures etc. (link to Pitfalls Whitepaper.)


• High availability.

Echo guarantees that the replication protocol will continue operating as long as a "quorum" or majority of the machines remain operational and in communication with each other. As opposed to other replication protocols that can "block" after failures even when a quorum is maintained, Echo always continues working in these situations.


• Self-monitoring.
Echo tracks which copies of the data are currently running and keeps the application notified of the state of the system. As important as when the toolkit keeps running is when the toolkit stops itself. It will detect when there are not enough machines currently running and it will pause itself and tell the application that it is not safe to proceed. When a quorum is restored, it will notify the application and start running again.


• Software-only.
Echo uses protocols that operate purely through software and do not require any special hardware. Other approaches require special hardware such as special power supplies to allow machines to turn each other off (the so-called STOMITH or Shoot The Other Machine In The Head approach), programmable network switches that are used to disconnect machines from the network, multiple network connections (sometimes requiring 3 or more network connections), use of extra serial cable connections, or access to twin-tailed or shared access disk drives. These other approaches increase the complexity of configuring the resulting system, require additional hardware in the system, and add additional hardware that can fail. Many of these other approaches also just cannot work in Wide Area Networks settings (you cannot stretch a serial line across the country). The Ventura software-only approach avoids all these problems.


• Low latency.
The protocols used by Echo only require one and a half rounds of communication to make a change to the replicated data. The machines that are part of the quorum only have to write to persistent media (disk drive) once for each operation. This allows the system to operate with low latency speeds. Additionally, the system transfers updates to the data, rather than transferring the entire data, by tracking the changes that have occurred.


• High performance.
Echo’s replication protocol allows any number of changes to the system to be in flight at one time, which allows the operations to be worked on in parallel rather than one at a time. This means that applications that need to make many changes to their replicated state can do so very quickly. This is important in many applications, especially in WAN settings where there may be high network latency and adequate performance requires that operations are processed in parallel.


• Online configuration change.
Echo allows for the set of replicas to change while the application continues to run. This is a very complex protocol because of the potential for failures or other disruptions to occur while the change is being processed, but to the application it is just a simple request via the API. Most applications need to be able to change the set of copies that are being replicated to. For instance, if one of the machines fails, an application needs to be able to replace it. Another example is where additional machines need to be added to improve fault-tolerance or scalability of the application. The library allows machines to be added and removed at any time through the simple APIs


• Built-in testing.
Ventura Network’s primary focus is on providing distributed systems technologies that always work correctly. Because of the extreme complexity of these protocols and the difficulties of knowing when they are right, our replication protocol was designed from the ground up to be testable. This product was written with a full range of unit and sub-system testing, as well as embedded diagnosis and invariant checking software. In addition, the replication protocol supports use of Ventura's Shadow simulated testing harness, which allows the system to be tested thoroughly under billions of simulated failure scenarios and for behavior of the protocols to be cross-checked for correctness. We constantly have clusters of machines running tests against this software; as you read this paragraph these machines have tested thousands more failures.

Technologies Supported:
Networks Fabrics: Ethernet (including 802.11b & g), UDP, TCP; Programming Languages: C; Memory Requirements: 100 – 200 KB Object Size; Operating Systems: Linux, BSD (Microsoft Windows and Windriver's VxWorks to come).