Low level page/kv recovery from corrupted bbolt db (etcd)

Hello, I have ended up in with corrupt bbolt db on my single node etcd used for k8s and now I'm looking for some assistance and tools that could help in recovering from current situation. What happened in short:
- Server lost power
- Restarted
- Everything booted up
- Kubelet with manifest pods started

At first my auth was rejected, I grabbed super-admin.conf from server to use as auth, that worked. 
Full pod list only had api, etcd, scheduler and controller.

BBolt:
```
┌─────────┬──────────┬────────────┬────────────┬─────────┐
│  HASH   │ REVISION │ TOTAL KEYS │ TOTAL SIZE │ VERSION │
├─────────┼──────────┼────────────┼────────────┼─────────┤
│ 5fa1254 │ 25764424 │        321 │     112 MB │         │
└─────────┴──────────┴────────────┴────────────┴─────────┘
```

The fact that it still was 112MB, I had hopes and kept digging. Once found that Etcd is using bbolt, this repo provided some tools to get better understanding of the situation.

Db contains 27446 pages.
All pages after 407 are free; 407 is freelist containing 27212 items and with 53 page overflow.

Manually scanning though each page, I can receive 4033 pages with 37521 keys.
These command provided me that information:
```
for a in {0..27446}; do ./bbolt page db --format-value ascii-encoded $a > pages/$a.txt; done

// Search for available key count
grep -rin -P '\w{34}:' pages | wc -l

// Pages containing keys
grep -rin -P '\w{34}:' pages | awk -F':' '{ print $1 }' | sort | uniq | wc -l

// Corrupt pages:
grep -rin 'invalid value due to unexpected' pages | wc -l
```

For corrupt pages (with error ` error: invalid value due to unexpected Page id: 6073476030241133382 != 563` or similar), I was able to look up where in file corruption occured. I convert `6073476030241133382` to hex representation ( `0x46535F5645524954` ) and simply search for those bytes:
```
00232FC0   00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00  ................
00232FD0   00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00  ................
00232FE0   00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00  ................
00232FF0   00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00  ................
00233000   46 53 5F 56  45 52 49 54  59 22 3A 7B  7D 2C 22 66  FS_VERITY":{},"f
00233010   3A 46 53 5F  56 45 52 49  54 59 5F 42  55 49 4C 54  :FS_VERITY_BUILT
00233020   49 4E 5F 53  49 47 4E 41  54 55 52 45  53 22 3A 7B  IN_SIGNATURES":{
00233030   7D 2C 22 66  3A 46 54 4C  22 3A 7B 7D  2C 22 66 3A  },"f:FTL":{},"f:
00233040   46 54 52 41  43 45 22 3A  7B 7D 2C 22  66 3A 46 54  FTRACE":{},"f:FT
```
Often page appears to start inside of old data, which does not have valid Page header.

I attempted to fully wipe freelist (from meta pages via existing command `./bbolt surgery freelist abandon db --output db-abandoned`) and wipe entire block containing freelist page header and all items. In hopes that all pages which were freed, would be used again, but that did not work too well as various bbolt commands started crashing on first corrupt page (563).

Other attempts:
- ext4magic - attempt to recover older versions of bboltdb, hoever this not yelded any results, as some of the block appear to have been reused (overwritten) and tool does not export partially corrupt files.
- scandisk - attempt to recover deleted version of db, but there is nothing to recover, database appear to be overwritten in place

My thoughts on how I am planning to proceed:
1. Seems like best approach would be not to reconstruct corrupted database, but attempt to salvage as many kvs as possible, import them into empty etcd
2. Spend more time on .wal files, those seem like could be useful in restoring data
3. Scan through disk image which I saved after realizing that etcd is empty, for any additional salvageable pages containing keys


If anyone has any useful tips, I would greatly appreciate, etcd contained all necessary configuration for rook-ceph cluster, and that thing has my life in it...

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Low level page/kv recovery from corrupted bbolt db (etcd) #1033

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Low level page/kv recovery from corrupted bbolt db (etcd) #1033

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions