Skip to content

[tidb-x] tiflash wn crash repeatedly after recovery network partition from dfs #10624

@Lily2025

Description

@Lily2025

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

1、run ch
2、simulated dfs network partition last for 10mins
chaos time: 2025/12/21 11:35:58.043 +08:00 ~ 2025/12/21 11:45:58.044 +08:00
3、recovery fault

2. What did you expect to see? (Required)

after recovery fault,all tiflash are normal

3. What did you see instead (Required)

after recovery fault,tiflash wn crash repeatedly and can not recovery
{"namespace":"ha-test-tiflash-tps-8040398-1-848","log":"[2025/12/21 11:47:19.611 +08:00] [ERROR] [BaseDaemon.cpp:560] [\"\\n 0x556d665d42fe\\tfaultSignalHandler(int, siginfo_t*, void*) [tiflash+133485310]\\n \\tlibs/libdaemon/src/BaseDaemon.cpp:211\\n 0x7fed4af75bf0\\t<unknown symbol> [libc.so.6+257008]\\n 0x556d671ade7a\\tvoid DB::S3::TiFlashS3Client::setBucketAndKeyWithRoot<Aws::S3::Model::PutObjectRequest>(Aws::S3::Model::PutObjectRequest&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) const [tiflash+145911418]\\n \\tdbms/src/Storages/S3/S3Common.h:87\\n 0x556d671a460e\\tDB::S3::uploadFile(DB::S3::TiFlashS3Client const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, DB::EncryptionPath const&, std::__1::shared_ptr<DB::FileProvider> const&, int) [tiflash+145872398]\\n \\tdbms/src/Storages/S3/S3Common.cpp:638\\n 0x556d66fcb54d\\tstd::__1::__packaged_task_func<DB::DM::Remote::DataStoreS3::putDMFileLocalFiles(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>> const&, DB::DM::Remote::DMFileOID const&)::$_2, std::__1::allocator<DB::DM::Remote::DataStoreS3::putDMFileLocalFiles(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>> const&, DB::DM::Remote::DMFileOID const&)::$_2>, void ()>::operator()() (.0657bb016a1310a986a4a2750184c5f4) [tiflash+143934797]\\n \\tdbms/src/Storages/DeltaMerge/Remote/DataStore/DataStoreS3.cpp:85\\n 0x556d60b3b895\\tstd::__1::packaged_task<void ()>::operator()() [tiflash+38488213]\\n \\t/usr/local/bin/../include/c++/v1/future:1891\\n 0x556d60b39a79\\tDB::ThreadPoolImpl<DB::ThreadFromGlobalPoolImpl<false>>::worker(std::__1::__list_iterator<DB::ThreadFromGlobalPoolImpl<false>, void*>) [tiflash+38480505]\\n \\t/usr/local/bin/../include/c++/v1/__functional/function.h:517\\n 0x556d60b3c3a3\\tstd::__1::__function::__func<DB::ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void DB::ThreadPoolImpl<DB::ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), std::__1::allocator<DB::ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void DB::ThreadPoolImpl<DB::ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'()>, void ()>::operator()() [tiflash+38491043]\\n \\tdbms/src/Common/UniThreadPool.cpp:169\\n 0x556d60b3af68\\tvoid* std::__1::__thread_proxy[abi:ue170006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void DB::ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()>>(void*) [tiflash+38485864]\\n \\t/usr/local/bin/../include/c++/v1/__functional/function.h:517\\n 0x7fed4afc119a\\tstart_thread [libc.so.6+565658]\"] [source=BaseDaemon] [thread_id=1101]\n","pod":"tc-tiflash-0","container":"serverlog","time":"2025-12-21T03:47:19.972866215Z","stream":"stdout"}

4. What is your TiFlash version? (Required)

/tiflash/tiflash version
TiFlash
Release Version: v9.0.0-beta.2.pre-99-g0ddabdef5
Edition: Enterprise
Git Commit Hash: 0ddabde
Git Branch: HEAD
UTC Build Time: 2025-12-12 07:57:30
Enable Features: jemalloc sm4(GmSSL) mem-profiling avx2 avx512 unwind thinlto next-gen hnsw.l2=skylake hnsw.cosine=skylake vec.l2=skylake vec.cos=skylake
Profile: RELWITHDEBINFO
Compiler: clang++ 17.0.6

Raft Proxy
Git Commit Hash: 2505f2f8d3061d8e61aa6f4ff4b91ade95a50785
Git Commit Branch: HEAD
UTC Build Time: ""
Rust Version: rustc 1.77.0-nightly (89e2160c4 2023-12-27)
Storage Engine: tiflash
Prometheus Prefix: tiflash_proxy_
Profile: release

Metadata

Metadata

Assignees

No one assigned

    Labels

    component/storagenextgenIndicates that the Issue or PR belongs to the nextgen kernel architecture.severity/majortype/bugThe issue is confirmed as a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions