Documentation

The source code of ConcoLLMic is available on GitHub.

Overview

ConcoLLMic is a language- and theory-agonistic concolic execution engine, powered by a collection of LLM agents. ConcoLLMic performs fuss-free concolic testing across environments and languages, including polyglot systems.

Main features of ConcoLLMic include:

Source-code-level instrumentation for any language.
Flexible representations (natural language, code, SMT) for path constraints.
Readable test cases.
Minimizes hallucination with real executions, coverage tracking, and tool use.

System Diagram

System diagram of ConcoLLMic

Workflow Illustration: Floating-point theory in C

Phase 1: Instrumentation

Example instrumentation of src/count.c by inserting fprintf at key execution points. Output in stderr allows reconstruction of the dynamic execution path.

Instrumentation of ConcoLLMic

Phase 2: Concolic Testing

Compared to verbose formulas used by traditional tools, e.g.,

start = atof(I[0]) /\ end = atof(I[1]) /\
start < end /\ cur[0] = start /\ 
(for all 0 ≤ i < cnt, cur[i] != end) /\ 
cur[cnt] == end /\ cnt ≤ 20

ConcoLLMic summarizes the path constraints concisely and solves for an input:

The number of representable FP values in the range must be ≤ 20.

The example workflow to generate such a constraint (and a matching input) is shown below, where the Summarization Agent and the Solving Agent work together towards the goal.

Testing Phase of ConcoLLMic

Reproducing Results from Paper

For running example real-world projects, see instructions on GitHub.

User Guide of ConcoLLMic

Installation

https://github.com/ConcoLLMic/ConcoLLMic.git ConcoLLMic && cd ConcoLLMic
pip install -r requirements.txt
pip install -r requirements-dev.txt

export ANTHROPIC_API_KEY="YOUR_KEY_HERE"

For more details on using different models and setting up API keys, see the Model section.

Quick Start

Use the single entrypoint to instrument, run, and replay.

python3 ACE.py [subcommand] [options]

instrument: instrument the source code
run: run the concolic testing agent
replay: replay and examine generated tests

Step 1. Instrument the source code.

For the working examples, instrumented code is already provided in ./code_example/instr/. If you want to re-instrument them (or your own code), run:

python3 ACE.py instrument \
  --src_dir ./code_example/src/ \
  --out_dir ./code_example/instr/ \
  --instr_languages python,java,c,cpp

Step 2. Compile the instrumented code.

Build the instrumented example to get a runnable binary:

gcc -o ./code_example/instr/count ./code_example/instr/count.c

Step 3. Launch the concolic execution agent.

Run the test harness for 2 rounds with 3 parallel queries. It costs ~$0.40 with Claude-3.7.

python3 ACE.py run \
  --project_dir ./code_example/instr/ \
  --execution ./code_example/harness/count.py \
  --out out --rounds 2 --parallel_num 3

This will generate logs ./out/ConcoLLMic_*.log and test cases in ./out/queue/id:*.yaml. Optionally, view testing statistics (e.g., costs, time) with:

python3 ACE.py run_data ./out/

Step 4. Replay generated tests for coverage / debugging.

First, compile the program with gcov enabled and reset coverage counters:

cd ./code_example/src/
gcc --coverage -o count count.c
lcov -z -d .
cd ../../

Then, replay the generated tests to collect coverage data using the provided coverage script:

python3 ACE.py replay ./out/ ./code_example/src/ ./coverage.csv \
  --cov_script ./code_example/src/coverage.sh

This will generate a summary at coverage.csv and the detailed coverage at ./code_example/src/count.c.gcov. The gcov file would indicate that the bug at line 67 has been successfully triggered.

Instrumentation

instrument is the first step of ConcoLLMic. It instruments the source code with fprintf at key execution points. These printing statements are inserted to key syntactic points, and print a tuple of the form ([file_name], "enter"/"exit", function_name, id) to stderr. This structure allows ConcoLLMic to understand the structure of any source code. During exeuction, these stderr outputs can be used to reconstruct the dynamic execution path.

Usage. The usage of instrument is as follows:

--src_dir

--out_dir

--instr_languages

python,java,c,cpp

--exclude_dirs

deps,tests,examples

--parallel_num

10

--chunk_size

800

Concolic Testing

run is the second step of ConcoLLMic. It runs the concolic execution agent with the (compiled) instrumented code.

Usage. The usage of run is as follows:

--project_dir

input

--execution

project_dir

--resume_in

--out

--rounds

--selection

random

dfs

llm

random

--timeout

3

--plateau_slot

rounds

--parallel_num

5

Test Harness. The file specified by --execution is the initial execution harness, written in python. This is conceptually similar to the initial seed used in conventional concolic execution / fuzzing tools. It should contain a function of form execute_program(timeout: int) -> tuple[str, int] that executes the program and returns the stderr and exit code. The timeout is the timeout for the execution, in seconds. The function should run the program under test by

constructing an input (or environment),
passing it to the program (e.g., using subprocess.run), and
handling possible failures.

For concrete toy examples, see the code examples. For more realistic examples, see the harness for bc (C), woff2 (C++ with binary input), and ultrajson (Python and C),

Replay

After testing, you can use replay to re-execute the generated test cases on the program under test -- possibly recompiled with coverage instrumentation or different sanitizers -- to collect coverage data, detect more bugs with sanitizers, or understand the test cases.

Usage. The usage of replay is as follows:

--out_dir

--project_dir

--output_file

--cov_script

--timeout

3

Coverage Script. The coverage script is a custom script that is invoked after running each test case. It offers a convenient way to collect (1) overall coverage data (including l_per, l_abs, b_per, b_abs) and (2) the number of times a specific line has been covered. The script should take three arguments: (i) relative file path, (ii) line number, and (iii) line content. For concrete examples, see this script which collects gcov coverage of the program under test.

Statistics

After instrumentation or testing, you can use instrument_data and run_data to collect statistics. Reported statistics include the breakdown of test cases, and the time/cost expended.

Usage. The usage of instrument_data is as follows:

directory

--extensions

--output

Usage. The usage of run_data is as follows:

out_dir

--print-tokens

True

Models

This section summarizes the models used by ConcoLLMic and how to configure them.

By default, ConcoLLMic uses claude-3-7-sonnet (requires ANTHROPIC_API_KEY). See here for how to set up an Anthropic API key.
By default, ConcoLLMic uses temperature 0.0 for instrumentation and 0.5 for all other agents. You may adjust the temperature by changing the constant *_TEMPERATURE in app/agents/agent_*.py.
To other models, ...
To use GPT, OPENAI_API_KEY has to be set. See here for how to set up an OpenAI API key.

Docker

A Dockerfile is available on GitHub.