Comparing LLM results on the same prompt between 3 models - Oooooollama
In my journey to learn how to leverage LLMs for the benefit of my wife's Etsy shop, I decided to run some models locally via Ollama. The advantage of Ollama is that one can locally run open source models like Gemma3 or Mistral or even some of the Phi series (which in a few months will most probably be obsolete 😖). Moreover one can select to run the full model or some flavor of it (as far as parameters go). This post deals with testing the same prompt over 3 different models as a preliminary POC for the final project.
![]() |
AI generated image |
Hardware, the bane of local LLM serving
Hardware setup:
I RAG, you RAG, we all RAG!
Retrieval-augmented generation is a technique for enhancing the accuracy and reliability of generative AI models with information from specific and relevant data sources.
Data
The data in each file looks like this:
In fact, in the shop, the description includes all of that text and more (this is just a sample) and all I want to work on is the paragraph labeled Description.
Results may vary (and expectations may be shattered)
You are the owner of an Etsy shop and you are trying to make your product descriptions better. Provide any answers in well formed text.
Get me the description for Pineapple Opal.
Get me the description of Pineapple Opal as is
And this had a very interesting and different outcome in each of the three models
- Gemma3 27b
- Gemma3 1b
- Mistral
Enhance my data
- Gemma3 27b
The model made a very accurate, formal and professional description as if the person to wear the pineapple opal is going to a professional conference in the Bahamas.
- Gemma3 1b
The model made a very accurate, informal and youthful description as if the person to wear the pineapple opal is going for a summer vacation in the Bahamas (which is what I wanted to see).
- Mistral
The model acted like the Gemma3 27b but kept insisting on a white cross opal although the pineapple opal was fed to it 🤦.
Conclusion of the POC and lesson learned
It is my understanding that initial result of data coming in processed, is part of the role assigned and explained as "Provide well formed text". However, the "as is" part was mildly baffling at first .
Why would a 1b model do as it told whilst a 27b not? And then it hit me. RAG is all about providing more reference, building results as per sentiment, style and character of the question and the role. It is not meant to be used as is. In fact RAG is not even the right use for my needs. If I had a large knowledge base then it would have made sense. RAG is the perfect tool to enrich LLM responses on specific data the LLM has no prior knowledge (=not trained by that data) off. However on a product by product basis it had no point. Each product should have it's own query in a chat box.
As for the different sentiments and styles, it occurred to me that on the final app I will create for my wife, I should provide her with a means to change models and play with the results to her satisfaction.
Moreover the prompts must be preset in a list (maybe via MCP?) but with the ability to edit them per request or even enhance them and add them to the list of pre created prompts.
Comments
Post a Comment
Keep it clean, professional and constructive. Everything else including ethnic or religious references will be removed.