This is an automated archive made by the Lemmit Bot.

The original was posted on /r/java by /u/tjake on 2024-10-21 14:23:55+00:00.


Hello,

I am announcing a project that I have been working on since 2023.

Jlama is a java based inference engine for many text-to-text models on huggingface:

Llama 3+, Gemma2, Qwen2, Mistral, Mixtral etc.

It is intended to be used for integrating gen ai into java apps.

I presented it at devoxx a couple weeks back demoing: basic chat, function calling and distributed inference. Jlama uses Panama vector API for fast inference on CPUs so works well for small models. Larger models can be run in distributed mode which shards the model by layer and/or attention head.

It is integrated with langchain4j and includes a OpenAI compatable rest api.

It supports Q4_0 and Q8_0 quantizations and uses models of safetensor format. Pre-quantized models are maintined on my huggingface page though you can quantize models locally with the jlama cli.

Very easy to install and works great on Linux/Mac/Windows

#Install jbang (or https://www.jbang.dev/download/)
curl -Ls https://sh.jbang.dev/ | bash -s - app setup

#Install Jlama CLI 
jbang app install --force jlama@tjake

# Run the openai chat api and UI on a model
jlama restapi tjake/Llama-3.2-1B-Instruct-JQ4 --auto-download

Thanks!