JMC: Java Performance Profiling Simplified
In this blog post, I explore the capabilities of JDK Mission Control, a powerful tool for low-overhead performance analysis and diagnostics of Java applications.
If you’ve read my previous post, you’ll know that I have been using DeepSeek extensively. However, with the increasing popularity of DeepSeek, I have noticed a degradation in service performance, as reflected in the DeepSeek status. To mitigate this, I have switched to Perplexity Pro a complimentary service offered to Singtel customers. For those without access to this, an alternative is Google AI Studio. Having an AI pair programmer significantly enhances the troubleshooting process.
Setting up JMC
To begin, download and install JMC from the official JMC 9.0.0 downloads page.
As part of my ongoing work, I aimed to optimize the inference performance of Micronaut-Llama3 to support Unsloth’s DeepSeek-R1. Since DeepSeek-R1 only supports Q4 and Q8 quantization for the Llama architecture, I opted for the Q8_0 model.
To integrate support for this model, I made the following modifications:
Changes to micronaut/model/ChatFormat.java:
public ChatFormat(Tokenizer tokenizer) {
this.tokenizer = tokenizer;
Map<String, Integer> specialTokens = this.tokenizer.getSpecialTokens();
specialTokens.putIfAbsent("<|begin_of_text|>", 128000); // for DeepSeek-R1
specialTokens.putIfAbsent("<|end_of_text|>", 128001); // for DeepSeek-R1
this.beginOfText = getRequiredToken(specialTokens, "<|begin_of_text|>");
this.startHeader = getRequiredToken(specialTokens, "<|start_header_id|>");
this.endHeader = getRequiredToken(specialTokens, "<|end_header_id|>");
this.endOfTurn = getRequiredToken(specialTokens, "<|eot_id|>");
this.endOfText = getRequiredToken(specialTokens, "<|end_of_text|>");
this.endOfMessage = specialTokens.getOrDefault("<|eom_id|>", -1); // only in 3.1
this.stopTokens = Set.of(endOfText, endOfTurn);
}
Changes to micronaut/model/Tokenizer.java:
public Tokenizer(Vocabulary vocabulary, List<Pair<Integer, Integer>> merges, String regexPattern,
Map<String, Integer> specialTokens) {
specialTokens.putIfAbsent("<|begin_of_text|>", 128000); // for DeepSeek-R1
specialTokens.putIfAbsent("<|end_of_text|>", 128001); // for DeepSeek-R1
this.vocabulary = vocabulary;
this.compiledPattern = regexPattern != null ? Pattern.compile(regexPattern) : null;
this.specialTokens = new HashMap<>(specialTokens);
this.merges = new HashMap<>();
for (Pair<Integer, Integer> pair : merges) {
int firstIndex = pair.first();
int secondIndex = pair.second();
int mergeIndex = vocabulary.getIndex(vocabulary.get(firstIndex) + vocabulary.get(secondIndex))
.orElseThrow();
this.merges.put(pair, mergeIndex);
}
}
...
public String decode(List<Integer> tokens) {
String decoded = decodeImpl(tokens);
// Replace the original decodedBytesAsInts with the below
int[] decodedBytesAsInts = decoded.codePoints()
.map(cp -> {
Integer decodedByte = BYTE_DECODER.get(cp);
if (decodedByte == null) {
return (int) '?';
}
return decodedByte;
})
.toArray();
byte[] rawBytes = new byte[decodedBytesAsInts.length];
for (int i = 0; i < decoded.length(); i++) {
rawBytes[i] = (byte) decodedBytesAsInts[i];
}
return new String(rawBytes, StandardCharsets.UTF_8);
}
Profiling with JMC
To start profiling, simply initiate the Flight Recorder, as shown below:
Referencing the Llama3.java post, I ran the application using:
gradlew run
Configuration in application.properties
:
micronaut.application.name=llama3
micronaut.server.port=8888
llama.BatchSize=32
llama.VectorBitSize=512
llama.PreloadGGUF=DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf
options.model_path=DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf
options.temperature=0.1f
options.topp=0.95f
options.seed=42
options.max_tokens=512
options.stream=true
options.echo=true
options.fullResponseStream=true
Test URL:
http://localhost:8888/api/llama3/generate?prompt=Why%20is%20the%20sky%20blue?
The profiling results are as follows:
Performance Optimization: ByteVector Operations
The flame graph analysis highlighted ByteVector operations as an optimization opportunity:
Optimized code snippet:
// Instead of separate operations
ByteVector loBytes = wBytes.and(MASK_LOW).sub(OFFSET_8);
ByteVector hiBytes = wBytes.lanewise(VectorOperators.LSHR, 4).sub(OFFSET_8);
// Combine operations
ByteVector loBytes = wBytes.and(MASK_LOW);
ByteVector hiBytes = wBytes.lanewise(VectorOperators.LSHR, 4).and(MASK_LOW);
ByteVector combined = loBytes.blend(hiBytes.lanewise(VectorOperators.LSHL, 4), BLEND_MASK);
JMC vs. VisualVM: A Comparative Analysis
JMC offers a more sophisticated and efficient profiling experience, making it a preferred tool for optimizing Java applications at scale.
Feature | VisualVM | JDK Mission Control (JMC) |
---|---|---|
Ease of Use | Simple, user-friendly | Advanced, steeper learning curve |
Performance Overhead | Higher | Lower |
Flame Graphs | Requires plugins | Built-in |
Data Granularity | Basic monitoring data | Detailed, in-depth insights |
Best Use Case | General debugging & profiling | Low-overhead, enterprise-grade profiling |
Data Collection Method | JMX | Java Flight Recorder (JFR) |
By leveraging JMC, I was able to identify and optimize key performance bottlenecks in my project. If you’re working with Java applications that require in-depth profiling, JMC is a must-have tool.
Stay tuned for more insights on optimizing Java applications!