# Lazy Release Persistency

Mahesh Dananjaya, Vasilis Gavrielatos, \*Arpit Joshi, Vijay Nagarajan

University of Edinburgh, Intel\*





# Outline

- Release Persistency (RP)
  - Strengthens existing language-level persistency model
  - Strong enough to enable recovery of log-free data structures (LFDs)
- Lazy Release Persistency (LRP)
  - Microarchitectural mechanism to efficiently enforce RP



#### Non-volatile memory (NVM)

- Fast, byte-addressable persistent storage
- Promises cheap program recovery on a crash
- Program recovery requires crash consistency









**Create Node** 1.





- 1. Create Node
- 2. Update Pointer







1.

2.

3.





1.

2.

3.















#### Goal: Sweet-spot between programmability and performance













































ARP one-sided barriers [Kolli et al.]



#### Acquire-Release Persistency (ARP)





#### Acquire-Release Persistency (ARP)





## Our View: Atomicity vs Ordering

Atomicity is programmable but ordering important too (for *log-free* data structures, LFDs)



## Our View: Atomicity vs Ordering

Atomicity is programmable but ordering important too (for *log-free* data structures, LFDs)

LFDs are important use-case for persistent memory [Izraelevitz '17, David '18, Belloche '18]



## Our View: Atomicity vs Ordering

Atomicity is programmable but ordering important too (for *log-free* data structures, LFDs)

LFDs are important use-case for persistent memory [Izraelevitz '17, David '18, Belloche '18]

But ARP is not strong enough for enabling recovery of LFDs!













\*CAS : Compare-and-swap is a read-modify-write (RMW) instruction.







$$W_1 \rightarrow \text{Rel}$$



step 3:













$$W_1 \rightarrow W_4$$



#### Persistent Orderings for Recovery





#### Persistent Orderings for Recovery





#### Acquire-Release Persistency (ARP) is not strong enough!





#### Acquire-Release Persistency (ARP) is not strong enough!




### Acquire-Release Persistency (ARP) is not strong enough!



Full barriers annul the performance benefits of ARP!















Persist ordering mirrors consistency happens-before ordering

#### (sufficient for LFD recoverability [Izraelevitz and Scott '16])





### How to Enforce RP Efficiently?

- Requirements
  - $\circ \quad \textbf{W}_1 \to \textbf{Rel} \to \textbf{W}_4$
  - But the implementation needs to be buffered

(Suboptimal for  $W_1$  to persist when the Rel performs)



### How to Enforce RP Efficiently?

- Requirements
  - $\circ \quad \textbf{W}_1 \to \textbf{Rel} \to \textbf{W}_4$
  - But the implementation needs to be buffered

(Suboptimal for  $W_1$  to persist when the Rel performs)



These requirements match a 30-year old protocol for enforcing RC lazily!



### Lazy Release Consistency (LRC)

- On a release
  - Write to the local cache (no need to propagate writes before the release)
- On an acquire
  - Find matching release and write-back dirty blocks in the releasing processor



### Lazy Release Consistency (LRC)

- On a release
  - Write to the local cache (no need to propagate writes before the release)
- On an acquire
  - Find matching release and write-back dirty blocks in the releasing processor

Propagate data lazily to enforce RC



### Lazy Release Persistency (LRP)

- On a release
  - Write to the local cache (no need to persist writes before the release)
- On an acquire
  - Find matching release and persist dirty blocks in the releasing processor



### Lazy Release Persistency (LRP)

- On a release
  - Write to the local cache (no need to persist writes before the release)
- On an acquire
  - Find matching release and persist dirty blocks in the releasing processor

#### Persist data lazily to enforce RP



### Evaluation

Benchmarks

• SynchroBench LFD suite from [Gramoli '15]

Comparison points.

- Strict Barrier (SB)
- Buffered two-sided Barrier (BB) [Joshi '15]

Simulation infrastructure

- PRiME simulator
- Processor: 64-core, 2.5 GHz, x86-64 (TSO)
- Caches: L1 (32K), LLC-banked (1MB per bank)
- Coherence: Directory-based MESI
- Interconnection: 2D-Mesh network
- Memory : PCM-NVRAM



LRP vs SB

- Upto 71% Improvement ۲
- Average of 52% ۲







#### LRP vs SB

- Upto 71% Improvement
- Average of 52%

#### LRP vs BB [Joshi '15]

- Upto 44% Improvement
- Average of 33%





#### LRP vs SB

- Upto 71% Improvement
- Average of 52%

#### LRP vs BB [Joshi '15]

- Upto 44% Improvement
- Average of 33%



• Only upto 8% overhead





# Summary

- Crash consistency requires persistency primitives: Ordering or Atomicity?
- Languages should support atomicity but also ordering!
- Why? Log-free data structures can recover without needing atomicity
- State-of-the-art language primitive for ordering ARP is not strong enough for LFDs

#### Contribution

- Release Persistency a language-level persistency model
- Lazy release persistency (LRP) its microarchitectural implementation
- LRP shows 14%-44% improvement over 5 commonly used LFDs.





## **Backup Slides**

#### LRP vs SB

- Upto 71% Improvement
- Average of 52%

#### LRP vs BB [Joshi '15]

- Upto 44% Improvement
- Average of 33%



• Only upto 8% overhead





#### LRP vs SB

- Upto 71% Improvement
- Average of 52%

#### LRP vs BB [Joshi '15]

- Upto 44% Improvement
- Average of 33%



• Only upto 8% overhead





## Writebacks

#### BB LRP 100% % of write-backs in the critical path of execution 75% 50% 25% 0% linkedlist hashmap bstree skiplist queue

## LRP vs BB

• ~41% improvement in the critical path write-backs.

## **Uncached Mode**

#### LRP vs BB

- 24%-68% Improvement
- Average 52%





## LFD Recovery Guarantees



Stronger Guarantees

# LFD Recovery Guarantees



LRP's null recovery is easy to be extended into stronger guarantees without real time ordering

### **One-Sided Release Barrier**



- No coalescing and reordering across barriers
- Flushing data **eagerly** (all previous epochs)

- Coalescing and reordering across barriers
- Flushing data lazily

# **Buffered Epoch Persistency**

<u>Buffered Epoch Persistency (BEP):</u> <u>Rule 1: writes from different epochs can be buffered.</u> Rule 2: writes persist in their epoch order.

Persist Barriers have strong semantics:

- No reordering across barriers
- Persistency overhead is significant!

#### Implementation artifacts:

- Cache-line eviction/writeback conflicts
- Writing to the same cacheline conflicts







# **LRP-Examples**



\*LRP one-sided barriers aggressively reorder writes, persisting stores lazily <sup>64</sup>

# Hardware Extensions



| min-epoch | R | Tag | Data |  |
|-----------|---|-----|------|--|
|-----------|---|-----|------|--|

b) Cache-line metadata

| min-epoch | Address |
|-----------|---------|
| XX        | XXX     |
| XX        | XXX     |
|           |         |

#### c) Release Epoch Table (RET)



a) Overall Architetcure

## **Release Example**



# **One-Sided Acquire Barrier**



- No coalescing and reordering across barriers
- Flushing data eagerly (all previous epochs)

- Coalescing and reordering across barriers
- Flushing data lazily

# **One-Sided Acquire Barrier**

- Acquire can be enforced very easily than the Release barriers.
- If a read is annotated with an Acquire, no additional action is required (can easily be enforced with the min epoch Id mentioned in the Release).
- If a RMW is annotated with Acquire, *write* inside RMW is persisted first.
- Regular Write can never be annotated with acquire.

# All Cases Handled

- $Wr \rightarrow Wr$  : Write after write is no issue.
- **Rel**  $\rightarrow$  **Wr** : Write after release can be coalesced. unlimitedly.
- Wr → Rel : Release after write can be coalesces only if min epoch id is equal to rel epoch id -1.
- Rel → Rel : Release after release to a cacheline can be coalesced only if they are consecutive releases (epoch id+1). RET table is used, but can be optimized with/without the RET table as well.
- Different consistency guarantees: **RC-SC** or **RC-PC**.

### LRP One-Sided Release Barrier

2

3



**Key Points** 

- Release barrier is flagged using a bit
- Cache-lines store <u>only</u> min epoch Id
- "Pending-persist" counter to track write-backs
- 4 Persist engine to enforce the RC semantics



\*min epoch Id: First epoch that writes the cache-line.

### **Crash Consistency**

• Content of memory after a crash needs to be consistent



### Crash Consistency

• Content of memory after a crash needs to be consistent

