Checkpointing Overview

Reckoning with inconsistent and unpredictable power is one the major challenges of intermittent systems. Conventional programming and hardware architecture assumes consistent power supply largely as a given, and so intermittent programmers and designers must often determine and implement a method to mitigate power failure as part of their overall design.

Challenges

The major software challenges of designing an intermittent device include:

Ensuring proper (re)initialization of system state and peripherals after power loss: even a device using a matched operation approach may encounter extended periods of power loss (e.g. a device harvesting solar energy at night), so an intermittent device should be able to properly resume operation depending on its previous state and the length of power loss.
Ensuring forward application progress: some loss of progress is generally unavoidable on a device with unpredictable power. However, a device should be able to ensure that some progress is made on each power cycle, or at least have the ability to recognize when it is stuck and react accordingly.
Data consistency: most intermittent devices maintain volatile and non-violatile memory stores, and most checkpointing strategies rely on storing at least some information in non-violatile memory. Intermittent devices need to ensure that writes to non-violatile memory both complete successfully before power loss, and account for possible inconsistencies between non-violatile memory and program state (such as write after read errors).
Limited power budget: capacitors can hold only a fraction of energy than a similar-sized battery. Capacitor sizing and the method of energy harvesting can further reduce the amount of energy a device has available, with some small intermittent devices only having access to a few microwatts of power at any given point. As a result, the method(s) chosen should have as small a footprint as feasible in order to ensure that most energy is used for useful work.
Adaptability: an intermittent device should be able to adjust its operation to account for varying environments and energy availability, especially if it will be deployed across a variety of environments. Even for more narrow deployments, an inflexible device may fail if the testing environment fails to accurately model its real world application.

A general survey of the two main approaches tackling these challenges are explored below. This list is not intended to be exhaustive, as checkpointing/state management in intermittent computing is a significant and evolving area of research: rather, the objective is to provide a general overview of the most common methods currently available, along with their benefits and tradeoffs. These approaches also assume a standard (von Neumann) device architecture: while other hardware configurations are being explored in intermittent computing, low-power devices with traditional architectures such as the MSP430 are cheaper and more accessible and so currently make up the vast majority of intermittent devices being designed and tested.

Matched/Energy-Neutral Operation

The first approach is to try and match energy expenditure of the device to the amount of energy currently being harvested: for example, an environmental sensor might adjust how often it samples the surrounding environment to maintain or stay above a given energy threshold. This provides several benefits: a designer can (to an extent) assume steady power like a traditional battery-powered device, while avoiding much of the overhead that comes from checkpointing and the potential issues that arise from constant power failures and restarts. This also makes matched operation a simpler approach when converting existing battery devices to batteryless operation, as it is usually easier to adjust a device’s power level than rewrite code that may be prohibitive to access or change.

While simple in theory, there are a number of practical challenges. The most obvious is the underlying assumption of sufficient, relatively constant harvested energy: while it may be possible for a device to sleep through periods of scarcity, if a device is unable to meet its target thresholds it may achieve no forward progress at all.

Less obvious is managing the balancing act of energy expended versus energy harvested. Even relatively steady energy sources (such as natural sunlight) may have fluctuations throughout the day, requiring the device to recognize these changes and adjust its consumption accordingly. Energy consumption by a given program or operation can also be difficult to predict, particularly on code deployed across heterogeneous devices and peripherals: this requires significant testing and evaluation to determine energy consumption, or the energy management logic to be able to determine it dynamically.

Examples

AsTAR
Flute

Checkpointing

The second approach accepts power loss as inevitable and focuses on recovering from power loss, rather than trying to forestall it. Under checkpointing a device waits until it has received sufficient power before it activates, performing a task until it either runs out of power or completes the task in question, repeating the process as it harvests and expends energy. Before power loss the device should save (checkpoint) its work in non-violatile memory, allowing continuation once power is restored.

This method brings its own challenges. First, a specific checkpointing strategy must be chosen: as writing to non-violatile memory is relatively expensive from an energy perspective, an ideal implementation will work to minimize checkpoint sizes and frequency. Ensuring forward progress is also more difficult, and often goes hand in hand with capacitor selection: depending on the methods and capacitors chosen and ambient energy, a device may fall victim to a loop where it lacks the energy to advance to the next checkpoint, constantly restarting and failing at the same step of execution. Frequent outages also impact time tracking, with exact measurements after extended power loss being difficult if not impossible: in applications where timeliness is a concern, a checkpointing strategy will need to determine if/when an ongoing process or piece of data is no longer timely/fresh and react accordingly.

With these challenges in mind, why select checkpointing? The simplest answer may be that the device’s intended use requires it: if the energy available during deployment cannot be guaranteed with any consistency, then the device’s operation will need to account for that infrequency and downtime. Another reason is device and capacitor size: under checkpointing a capacitor need only be as large as necessary to complete the most energy intensive task between two checkpoints, allowing for selection of a smaller capacitor and its attendant advantages (faster charge, lower power loss). The size requirements of smaller devices may also preclude larger capacitors, requiring a device to operate on a much more limited power budget at any given time.

Within checkpointing, two primary subapproaches exist:

Standard Checkpointing

Standard checkpointing saves the current program state at periodic intervals, ensuring the program can be resumed from the last checkpoint upon power restoration. These checkpoints must ensure that all necessary intermediate data is saved, including variables that can potentially lead to data inconsistency if not maintained (e.g. loop count variables), as well as the location of the checkpoint within the code execution itself.

A checkpoint can theoretically be placed at any point in the execution: in practice atomicity and timeliness constraints will have an impact on where and how checkpoints are made. For example, there should be no checkpoints between code that checks if a peripheral is active and reading input from the peripheral, as there is no guarantee that a peripheral will be active immediately after a checkpoint is restored. Likewise, there may be little to no value in processing input data from a sensor if that data is no longer fresh or relevant when the device resumes.

Checkpoint placement must also strike a balance on frequency: all checkpoints beyond the last performed before power failure are technically suboptimal, as they result in expensive writes to non-violatile memory that are otherwise unneeded. Some options for improving checkpoint efficiency:

Checkpoint only at certain energy thresholds, either skipping checkpoints when energy is plentiful or forcing an automatic checkpoint once available energy falls below a certain level Minimizing the amount of data saved at each checkpoint, such as saving only variables that have changed between the current and previous checkpoint

Examples

Task-based Checkpointing

Task-based checkpointing approaches state management by breaking the device’s code into smaller individual tasks. Under this approach, each task is a separate atomic unit where progress is only saved once the task is complete and the output calculated: if a task is interrupted by a power failure, the device simply restarts the task from the beginning. Compared to standard checkpointing this reduces the raw amount of data that needs to be saved at any individual checkpoint: the device need only concern itself with the inputs and outputs for a given task, rather than having to save any intermediate variables or state to non-violatile memory at a given checkpoint.

The primary challenge lies in dividing the program into individual tasks: any given task must be able to complete in a single charge, so individual tasks should be as short as feasible. Breaking a program into multiple tasks and the resulting routing can potentially be complex, especially with machine learning/DNN models, and may require extensive (if not ground-up) rewrites of existing code.

Examples

Alpaca
Mayfly
Artemis
Chain
InK

Checkpointing and Hardware Considerations

More so than traditional devices, checkpointing and hardware design go hand in hand, with decisions made in one area affecting design considerations in the other. So far we have assumed the bare minimum for an intermittent device: an energy harvester, a single capacitor, and a microcontroller. In practice, most intermittent devices will utilize peripherals to sense, interact and communicate with the world around them. More complicated capacitor/power storage setups can also complicate checkpointing logic.

The Problems With Peripherals

Most checkpointing implementations concern themselves primarily with the microcontroller: the vast variety of peripherals available makes extending the implementation to cover all potential device configurations impractical for most researchers and developers. Additionally, most peripherals are not designed with intermittent operation (much less checkpointing) in mind, and often operate asynchronously from the microcontroller. These two factors throw a wrench into the (relatively) ideal assumptions made so far.

Consider a simple temperature sensor. On power restoration the checkpointing implementation will need to determine if the sensor should be enabled at all, and if so take the proper steps to initialize it (some devices may demonstrate peculiar behavior on power failure if specific steps are not taken). Once done, the device must retrieve measurement(s) from the sensor and properly store the data for additional processing: this must also be coordinated with whatever tasks the microcontroller is currently performing. Peripheral power draw can further complicate energy monitoring and management, as a checkpointing strategy based on just-in-time checkpoints may fail if active peripheral energy consumption is not properly factored into the calculations.

The Concerns With Capacitors

Sizing capacitors in intermittent devices must balance size with speed. More specifically: smaller capacitors charge more quickly while holding less power, while larger capacitors store more power but are subject to longer charge times and higher leakage.

The issue arises when the requirements of the device and its peripherals each prioritize differing capacitances. Take a simple device with a low power environmental sensor and a transmitter. The sensor favors smaller capacitance, in order to charge more frequently and retrieve more samples (while being less likely to miss interesting events). Transmitting packets, however, is energy intensive (with even low-power methods requiring energy equal to tens of thousands of operations on a low power microcontroller). The transmitter may not be able to even successfully broadcast below a certain capacitance, but the larger capacitance is in direct conflict with the preference of the sensor to have shorter (but more frequent) bursts of energy for detection purposes.

To this end a variety of capacitor configurations have been explored. These configurations are the subject of their own article, but can range from dynamically adjustable capacitor banks to individual capacitors for each microcontroller and/or peripheral. Knowing the exact capacitor configuration can impact checkpointing: a strategy that assumes a fixed capacitance will obviously struggle with a dynamic bank, and multiple capacitors can complicate energy availability predictions depending on arrangement.

Designing Adaptable Strategies

TBD

CICADA Wiki

Table of Contents

Checkpointing Overview

Challenges

Matched/Energy-Neutral Operation

Checkpointing

Standard Checkpointing

Task-based Checkpointing

Checkpointing and Hardware Considerations

The Problems With Peripherals

The Concerns With Capacitors

Designing Adaptable Strategies

References