Why does the response take so long? #108

LiveNathan · 2024-11-10T16:54:43Z

LiveNathan
Nov 10, 2024

Hi @tjake ! Thanks so much for your work on this project. I love it!

Just trying it out for the first time. It's so slow! Is it something I'm doing wrong?

Running locally on my MacBook Pro, responses take over 60 seconds.

Deploying to production, responses take 20-30 seconds.

public class LangchainAiAdapter implements AiGateway {
    private final StreamingAssistant streamingAssistant;

    public LangchainAiAdapter() {
        StreamingChatLanguageModel model = JlamaStreamingChatModel.builder()
                .modelName("tjake/TinyLlama-1.1B-Chat-v1.0-Jlama-Q4")
                .temperature(0.3f)
                .build();
        this.streamingAssistant = AiServices.builder(StreamingAssistant.class)
                .streamingChatLanguageModel(model)
                .systemMessageProvider(o -> "system message")
                .chatMemoryProvider(memoryId -> MessageWindowChatMemory.withMaxMessages(10))
                .build();
    }

    @Override
    public Flux<String> sendMessageAndReceiveReplies(UserMessageDto userMessageDto) {
        return streamingAssistant.chat(userMessageDto.chatId(), userMessageDto.userMessageText());
    }
}

Answered by tjake

Nov 10, 2024

Ohhh, you're using the old version of jlama that went with langchain4j, you need to exclude that version and use the latest (sorry, this is the issue with bundling with langchain4j)

        <dependency>
            <groupId>dev.langchain4j</groupId>
            <artifactId>langchain4j</artifactId>
            <version>0.35.0</version>
        </dependency>
        <dependency>
            <groupId>dev.langchain4j</groupId>
            <artifactId>langchain4j-spring-boot-starter</artifactId>
            <version>0.35.0</version>
        </dependency>
        <dependency>
            <groupId>dev.langchain4j</groupId>
            <artifactId>langchain4j-reactor</artifactId>
            <ver…

View full answer

tjake · 2024-11-10T17:05:12Z

tjake
Nov 10, 2024
Maintainer

Hi @LiveNathan

Are you including the native bindings? You need to explicitly add them

https://github.com/langchain4j/langchain4j-examples/blob/main/jlama-examples/pom.xml#L21C1-L34C1

I don't have a x86 mac but I build the bindings and see they work:

https://github.com/tjake/Jlama/actions/runs/11677523417/job/32515542812#step:4:11670

So try to see if thats the issue. Also what JVM are you using?

7 replies

tjake Nov 10, 2024
Maintainer

Can you try non-graal? I've noticed the panama API performs much worse currently with Graal.

Are you able to post debug logs?

tjake Nov 10, 2024
Maintainer

Also, can you try tjake/Llama-3.2-1B-Instruct-JQ4?

LiveNathan Nov 10, 2024
Author

Can you try non-graal?

Sure. Is there something from sdkman that you recommend?

Are you able to post debug logs?

I'm a beginner. Can you tell me what that means? I'm happy to share any logs.

LiveNathan Nov 10, 2024
Author

And you noticed that in production I'm using eclipse-temurin-23-alpine? I just couldn't figure out how to us graal-23 in production.

LiveNathan Nov 10, 2024
Author

Also, can you try tjake/Llama-3.2-1B-Instruct-JQ4?

Yes. The build fails with tjake/Llama-3.2-1B-Instruct-JQ4.

at dev.nathanlively.adapter.out.ai.LangchainAiAdapter.<init>(LangchainAiAdapter.java:18) ~[!/:1.0-SNAPSHOT]

at dev.nathanlively.config.ApplicationConfig$$SpringCGLIB$$0.aiGateway(<generated>) ~[!/:1.0-SNAPSHOT]

Caused by: java.lang.reflect.InvocationTargetException: null

2024-Nov-10 17:31:58.567902
Caused by: java.util.NoSuchElementException: lm_head.weight
2024-Nov-10 17:31:58.567902
at com.github.tjake.jlama.safetensors.SafeTensorIndex.load(SafeTensorIndex.java:192) ~[jlama-core-0.5.0.jar!/:na]
2024-Nov-10 17:31:58.567902
at com.github.tjake.jlama.safetensors.WeightLoader.load(WeightLoader.java:34) ~[jlama-core-0.5.0.jar!/:na]
2024-Nov-10 17:31:58.567902
at com.github.tjake.jlama.model.llama.LlamaModel.loadOutputWeights(LlamaModel.java:133) ~[jlama-core-0.5.0.jar!/:na]
2024-Nov-10 17:31:58.567902
at com.github.tjake.jlama.model.AbstractModel.(AbstractModel.java:148) ~[jlama-core-0.5.0.jar!/:na]
2024-Nov-10 17:31:58.567902
at com.github.tjake.jlama.model.llama.LlamaModel.(LlamaModel.java:57) ~[jlama-core-0.5.0.jar!/:na]
2024-Nov-10 17:31:58.567902
at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62) ~[na:na]

tjake · 2024-11-10T19:29:13Z

tjake
Nov 10, 2024
Maintainer

Ohhh, you're using the old version of jlama that went with langchain4j, you need to exclude that version and use the latest (sorry, this is the issue with bundling with langchain4j)

        <dependency>
            <groupId>dev.langchain4j</groupId>
            <artifactId>langchain4j</artifactId>
            <version>0.35.0</version>
        </dependency>
        <dependency>
            <groupId>dev.langchain4j</groupId>
            <artifactId>langchain4j-spring-boot-starter</artifactId>
            <version>0.35.0</version>
        </dependency>
        <dependency>
            <groupId>dev.langchain4j</groupId>
            <artifactId>langchain4j-reactor</artifactId>
            <version>0.35.0</version>
        </dependency>
        <dependency>
            <groupId>dev.langchain4j</groupId>
            <artifactId>langchain4j-jlama</artifactId>
            <version>0.35.0</version>
            <exclusions>
                <exclusion>
                    <groupId>com.github.tjake</groupId>
                    <artifactId>jlama-core</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>com.github.tjake</groupId>
            <artifactId>jlama-native</artifactId>
            <classifier>${os.detected.name}-${os.detected.arch}</classifier>
            <version>${jlama.version}</version>
        </dependency>
        <dependency>
            <groupId>com.github.tjake</groupId>
            <artifactId>jlama-core</artifactId>
            <version>${jlama.version}</version>
        </dependency>

As for the jdk, use 23.0.1-tem

6 replies

LiveNathan Nov 10, 2024
Author

Much faster now! Responses between 8 and 60 seconds on my local machine. And, now I can use tjake/Llama-3.2-1B-Instruct-JQ4. Thanks for your help!

My main goal is function calling. Do you have a recommendation for which model to use?

And, just for fun, here's me fighting with the model:

Nathan
Nov 10, 2024, 5:49 PM
What is the 3th planet from the sun?

Assistant
Nov 10, 2024, 5:49 PM
The third planet from the Sun is Neptune.

Nathan
Nov 10, 2024, 5:49 PM
What is the 6th planet from the sun?

Assistant
Nov 10, 2024, 5:49 PM
The 6th planet from the Sun is Uranus.

Nathan
Nov 10, 2024, 5:51 PM
According to other sources, the 3rd planet from the Sun is Earth. Why do you think it is Neptune?

Assistant
Nov 10, 2024, 5:51 PM
The reason Neptune is often considered the 3rd planet from the Sun is due to its relatively close proximity to the Sun. In our solar system, the order of the planets is typically listed as:

Mercury
Venus
Earth
Mars
Jupiter
Saturn
Uranus
Neptune

This order is based on the order in which the planets were discovered, with the first four planets being discovered in the 17th and 18th centuries, and the last two being discovered in the 19th century.

Nathan
Nov 10, 2024, 5:53 PM
According to your list, Earth is the 3rd planet from the sun, not Neptune.

Assistant
Nov 10, 2024, 5:53 PM
You are correct that according to the list I provided earlier, Earth is the 3rd planet from the Sun, not Neptune.

The correct list is:

Mercury
Venus
Earth
Mars
Jupiter
Saturn
Uranus
Neptune

So, the 3rd planet from the Sun is actually Earth, not Neptune.

tjake Nov 11, 2024
Maintainer

The next version of langchain4j has improved tool calling support see langchain4j/langchain4j#2029. I found Qwen/Qwen2.5-1.5B-Instruct worked the best and was fast

LiveNathan Nov 11, 2024
Author

Thanks! By the way, thanks for your talk at Devoxx. I'd also love to use Jvector as part of the toolchain. I assume you saw this issue?

LiveNathan Nov 13, 2024
Author

The next version of langchain4j has improved tool calling support see langchain4j/langchain4j#2029. I found Qwen/Qwen2.5-1.5B-Instruct worked the best and was fast

Qwen/Qwen2.5-1.5B-Instruct is one that you have not pre-quantized? So Jlama will just handle that the first time I run it?

tjake Nov 13, 2024
Maintainer

Yeah it can run un-quantized. Quantizing made the tool calling worse. Your call really.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does the response take so long? #108

{{title}}

Replies: 2 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Why does the response take so long? #108

LiveNathan Nov 10, 2024

Replies: 2 comments · 13 replies

tjake Nov 10, 2024 Maintainer

tjake Nov 10, 2024 Maintainer

tjake Nov 10, 2024 Maintainer

LiveNathan Nov 10, 2024 Author

LiveNathan Nov 10, 2024 Author

LiveNathan Nov 10, 2024 Author

tjake Nov 10, 2024 Maintainer

LiveNathan Nov 10, 2024 Author

tjake Nov 11, 2024 Maintainer

LiveNathan Nov 11, 2024 Author

LiveNathan Nov 13, 2024 Author

tjake Nov 13, 2024 Maintainer

LiveNathan
Nov 10, 2024

Replies: 2 comments 13 replies

tjake
Nov 10, 2024
Maintainer

tjake Nov 10, 2024
Maintainer

tjake Nov 10, 2024
Maintainer

LiveNathan Nov 10, 2024
Author

LiveNathan Nov 10, 2024
Author

LiveNathan Nov 10, 2024
Author

tjake
Nov 10, 2024
Maintainer

LiveNathan Nov 10, 2024
Author

tjake Nov 11, 2024
Maintainer

LiveNathan Nov 11, 2024
Author

LiveNathan Nov 13, 2024
Author

tjake Nov 13, 2024
Maintainer