Smart Contract Security Tools: A Comprehensive Review
Introduction
At Kleros, security is paramount. As we build decentralized dispute resolution systems that handle significant value, we need to ensure our smart contracts are resistant to attacks and vulnerabilities. While human review remains essential, automated tools can help identify issues early in the development process and provide an additional layer of security.
We recently conducted a comprehensive evaluation of popular smart contract security tools to assess their effectiveness and determine whether any should be incorporated into our development workflow. This article shares our findings to help other teams make informed decisions about their security tooling.
Why We Conducted This Review
As Kleros continues to develop and deploy complex smart contracts, we're constantly looking to improve our security practices. Automated tools promise to catch common vulnerabilities with minimal effort, potentially saving time and reducing risk. However, with numerous options available, it's difficult to know which tools actually deliver on these promises.
Our objectives were to:
- Evaluate the effectiveness of various security tools against known vulnerabilities
- Assess false positive rates and report quality
- Determine which tools, if any, should be integrated into our development workflow
- Share our findings with the broader Ethereum ecosystem
Methodology
We tested a range of security tools across three categories:
- Static Analyzers: Tools that examine code without executing it
- Slither: Trail of Bits' Python-based static analyzer
- Aderyn: Cyfrin's Rust-based static analyzer
- Mythril: ConsenSys' symbolic execution tool
- SolidityScan: Commercial automated scanner
- ChatGPT (as a comparative baseline)
- Fuzzers: Tools that generate random inputs to find vulnerabilities
- Diligence Fuzzing: ConsenSys' fuzzing service
- Echidna: Trail of Bits' property-based fuzzer
- Symbolic Testing: Tools that mathematically verify code properties
- Halmos: a16z's symbolic testing framework
Benchmark Contracts
We used a variety of smart contracts with different complexity levels and known vulnerabilities:
Very Easy: Simple contracts from HackingWorkshop with deliberately flawed design:
- Store: Minimal storage contract with an array-spamming vulnerability.
- DiscountedBuy: Simulates a purchase mechanism with a deliberate arithmetic bug.
- HeadOrTail: A gambling contract exposing issues in randomness and fund management. For fuzzer tests, this was replaced with Coffers.
- Vault: A simple fund storage contract with a potential reentrancy vulnerability.
- Registry: A registry contract with flawed hash collision validation.
Easy: Real Kleros contract with a straightforward design:
- DisputeResolverV2: Real version of the resolver contract to check for false positives and possible findings.
- DisputeResolverBugged: Same dispute resolver contract but with a specifically added BUG that simulates a human error during production (access control bug).
Middle: Real Kleros contract with a more elaborate design:
- RealitioForeignProxyArbitrum: Real version with no known issues, used to test precision in complex scenarios and check for feedback from checkers.
- RealitioForeignProxyArbitrumBugged: Same version with an intentional BUG.
Hard: KlerosV2 core stack:
- SortitionModule: Covers all the dependencies of V2 protocol.
- SortitionModuleBugged: Same version with an intentional BUG in arbitrator selection, illustrating challenges in complex systems. Note that this is a real bug that was found recently during Foundry unit testing.
For each complexity level beyond "Very Easy," we included both the original contract and a version with an intentionally introduced bug. This approach allowed us to test both vulnerability detection capabilities and false positive rates.
Grading System
We graded each tool on an A-F scale based on:
- Ability to detect known vulnerabilities
- Quality and clarity of reports
- False positive rate
- Ease of use and integration
Results
Below is a summary table of our findings, showing which tools successfully identified vulnerabilities in each test contract:
Results Table
Contract |
ChatGPT |
|||||||
Store | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | - |
DiscountedBuy | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ | - |
HeadOrTail/Coffers | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | - |
Vault | ✅ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | - |
Registry | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
DisputeResolver | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | - |
RealitioForeignProxy | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | - |
SortitionModule | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | - |
Overall Grade | D | D | D+ | F | B | F | D | D- |
Static Analyzers
Slither (Grade: D)
Slither identified only 2 out of 5 bugs in our very easy contracts. It generated verbose reports that often missed critical issues while flagging numerous non-issues. The reports were lengthy, making it difficult to distinguish genuine vulnerabilities from false positives. Common false positives included re-entrancy warnings and external calls in loops, even when these weren't exploitable.
Aderyn (Grade: D)
Aderyn, a newer Rust-based analyzer, only correctly identified 1 out of 5 easy vulnerabilities. Its reports were even more verbose than Slither's, with an excessive number of false positives such as "locked ETH" and "unchecked addresses." It did find some legitimate but minor issues in our most complex contracts, including unused custom errors and unspecified integer types.
Mythril (Grade: D+)
Mythril produced the least noisy reports among the dedicated static analyzers but still only identified 1 out of 5 easy vulnerabilities. Its symbolic execution engine seemed promising, detecting specific attack possibilities like "An attacker may be able to run a transaction after our transaction which can change the value of the call." However, it generated empty reports for several contracts and had lengthy analysis times even for small contracts.
SolidityScan (Grade: F)
SolidityScan (QuickScan version) was extremely limited, providing only general issue categories without specific details. It correctly identified 2 out of 5 easy vulnerabilities but also reported numerous false positives, particularly related to access control. For our complex KlerosV2 stack, it claimed to find 214 issues, none of which appeared useful from the limited information available in the free version.
ChatGPT (Grade: B)
Surprisingly, ChatGPT outperformed dedicated static analyzers, correctly identifying all 5 vulnerabilities in our easy contracts. It also correctly identified the bug in our easy-level disputed resolver contract. It provided insights like "array can be spammed with junk" and identified the need for a commit-reveal scheme in HeadOrTail. It struggled with more complex systems but provided more concise and actionable reports than specialized tools. False positives could be clarified through further questioning.
Fuzzers
Diligence Fuzzing (Grade: F)
Diligence Fuzzing required extensive setup, including writing Scribble notations and deployment scripts. Even with this effort, it only detected 1 out of 5 vulnerabilities and required significant constraints to trigger even known issues. For example, it could only detect the array-spamming issue in Store when the array length was artificially limited to 10 elements. Some reports disappeared unpredictably during testing.
Echidna (Grade: D)
Echidna was slightly easier to use than Diligence Fuzzing but still required writing custom test properties. It correctly identified 2 out of 5 vulnerabilities but only when provided with significant constraints and guidance. Like Diligence Fuzzing, it could only detect the Store issue with artificially limited array sizes. Its inability to use parameters in test functions limited its usefulness for detecting issues like hash collisions.
Symbolic Testing
Halmos (Grade: D-)
Halmos, a specialized tool designed for specific use cases rather than general vulnerability detection, failed to detect hash collisions in our Registry contract. It had significant limitations in handling various data types, with uint values capped at relatively low numbers (~10^15) and inability to generate problematic strings. It performed well only with address-type inputs.
Comparison with Industry Standards
Our findings differ somewhat from other published comparisons of smart contract security tools. For instance, a study by Charingane found MythX (which uses Mythril) to be effective for detecting medium-severity issues, while our testing showed Mythril missing most vulnerabilities.
Similarly, resources like CoinMarketCap's guide and 101Blockchains' review tend to focus on features and capabilities rather than empirical effectiveness. Our hands-on testing revealed significant gaps between marketing claims and actual vulnerability detection.
Most industry reviews also don't compare AI-based tools like ChatGPT with dedicated security analyzers. Our finding that ChatGPT outperformed specialized tools for simple to moderate contracts suggests that the security tool landscape may be evolving more rapidly than reflected in conventional wisdom.
Discussion
Our results reveal significant limitations in current automated security tools for smart contracts:
- Poor Detection Rates: Most dedicated tools missed even basic vulnerabilities in simple contracts. While they might catch certain classes of issues (e.g., reentrancy or unchecked returns), they failed on contract-specific logic flaws.
- Excessive Noise: The static analyzers generated verbose reports filled with false positives, making it difficult to identify genuine issues. This "alert fatigue" can lead to real vulnerabilities being overlooked.
- High Setup Overhead: Fuzzers and symbolic testing tools required significant configuration and guidance to test even simple properties. The effort required often exceeded the benefit gained.
- ChatGPT's Surprising Performance: Despite not being designed specifically for smart contract security, ChatGPT outperformed dedicated tools on simple contracts while providing more concise and actionable reports. This suggests that general reasoning capabilities may be more valuable than specialized heuristics for certain types of analysis.
- Diminishing Returns with Complexity: All tools, including ChatGPT, performed worse as contract complexity increased. For complex systems like KlerosV2, no tool successfully identified the intentionally introduced vulnerabilities.
Practical Recommendations
Based on our evaluation, we recommend the following practices for smart contract security:
- Use Multiple Complementary Tools: No single tool caught all issues. Using several tools in parallel might provide broader coverage, but be prepared for significant noise.
- Consider AI Assistance: ChatGPT or similar AI tools can provide a helpful additional perspective for initial reviews, particularly for simpler contracts, though human expertise is still required for final validation.
- Focus on Contract-Specific Testing: Generic tools struggled with logical flaws specific to a contract's business logic. Invest in comprehensive test suites that verify your specific requirements.
- Use Foundry's Native Fuzzer: For fuzzing needs, Foundry's built-in fuzzer offered a better developer experience than standalone alternatives in our testing.
- Prioritize Manual Review: Thorough peer review remains the most effective method for detecting vulnerabilities, especially for complex contract systems.
Conclusion
Our evaluation found significant limitations in current smart contract security tools. Most dedicated tools missed basic vulnerabilities while generating excessive false positives, making them challenging to integrate into development workflows.
ChatGPT's surprisingly strong performance suggests that general reasoning capabilities may be as valuable as specialized security heuristics for certain types of analysis, at least for simpler contracts. However, all tools—automated and AI-based—struggled with complex contract systems.
For now, we recommend a multi-layered approach that combines:
- Thorough manual code review by experienced engineers
- Comprehensive test suites targeting contract-specific logic
- Selective use of automated tools with a critical eye toward their limitations
- Consideration of AI-assisted review as a supplementary measure
We hope this evaluation helps other teams make informed decisions about their security practices. We remain committed to improving our security processes and will continue to evaluate new tools as they emerge.