-
Notifications
You must be signed in to change notification settings - Fork 702
Description
Hello, I have ended up in with corrupt bbolt db on my single node etcd used for k8s and now I'm looking for some assistance and tools that could help in recovering from current situation. What happened in short:
- Server lost power
- Restarted
- Everything booted up
- Kubelet with manifest pods started
At first my auth was rejected, I grabbed super-admin.conf from server to use as auth, that worked.
Full pod list only had api, etcd, scheduler and controller.
BBolt:
┌─────────┬──────────┬────────────┬────────────┬─────────┐
│ HASH │ REVISION │ TOTAL KEYS │ TOTAL SIZE │ VERSION │
├─────────┼──────────┼────────────┼────────────┼─────────┤
│ 5fa1254 │ 25764424 │ 321 │ 112 MB │ │
└─────────┴──────────┴────────────┴────────────┴─────────┘
The fact that it still was 112MB, I had hopes and kept digging. Once found that Etcd is using bbolt, this repo provided some tools to get better understanding of the situation.
Db contains 27446 pages.
All pages after 407 are free; 407 is freelist containing 27212 items and with 53 page overflow.
Manually scanning though each page, I can receive 4033 pages with 37521 keys.
These command provided me that information:
for a in {0..27446}; do ./bbolt page db --format-value ascii-encoded $a > pages/$a.txt; done
// Search for available key count
grep -rin -P '\w{34}:' pages | wc -l
// Pages containing keys
grep -rin -P '\w{34}:' pages | awk -F':' '{ print $1 }' | sort | uniq | wc -l
// Corrupt pages:
grep -rin 'invalid value due to unexpected' pages | wc -l
For corrupt pages (with error error: invalid value due to unexpected Page id: 6073476030241133382 != 563
or similar), I was able to look up where in file corruption occured. I convert 6073476030241133382
to hex representation ( 0x46535F5645524954
) and simply search for those bytes:
00232FC0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00232FD0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00232FE0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00232FF0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00233000 46 53 5F 56 45 52 49 54 59 22 3A 7B 7D 2C 22 66 FS_VERITY":{},"f
00233010 3A 46 53 5F 56 45 52 49 54 59 5F 42 55 49 4C 54 :FS_VERITY_BUILT
00233020 49 4E 5F 53 49 47 4E 41 54 55 52 45 53 22 3A 7B IN_SIGNATURES":{
00233030 7D 2C 22 66 3A 46 54 4C 22 3A 7B 7D 2C 22 66 3A },"f:FTL":{},"f:
00233040 46 54 52 41 43 45 22 3A 7B 7D 2C 22 66 3A 46 54 FTRACE":{},"f:FT
Often page appears to start inside of old data, which does not have valid Page header.
I attempted to fully wipe freelist (from meta pages via existing command ./bbolt surgery freelist abandon db --output db-abandoned
) and wipe entire block containing freelist page header and all items. In hopes that all pages which were freed, would be used again, but that did not work too well as various bbolt commands started crashing on first corrupt page (563).
Other attempts:
- ext4magic - attempt to recover older versions of bboltdb, hoever this not yelded any results, as some of the block appear to have been reused (overwritten) and tool does not export partially corrupt files.
- scandisk - attempt to recover deleted version of db, but there is nothing to recover, database appear to be overwritten in place
My thoughts on how I am planning to proceed:
- Seems like best approach would be not to reconstruct corrupted database, but attempt to salvage as many kvs as possible, import them into empty etcd
- Spend more time on .wal files, those seem like could be useful in restoring data
- Scan through disk image which I saved after realizing that etcd is empty, for any additional salvageable pages containing keys
If anyone has any useful tips, I would greatly appreciate, etcd contained all necessary configuration for rook-ceph cluster, and that thing has my life in it...
Thanks