Comparing LLM results on the same prompt between 3 models

Comparing LLM results on the same prompt between 3 models - Oooooollama

June 03, 2025

In my journey to learn how to leverage LLMs for the benefit of my wife's Etsy shop, I decided to run some models locally via Ollama. The advantage of Ollama is that one can locally run open source models like Gemma3 or Mistral or even some of the Phi series (which in a few months will most probably be obsolete 😖). Moreover one can select to run the full model or some flavor of it (as far as parameters go). This post deals with testing the same prompt over 3 different models as a preliminary POC for the final project.

AI generated image

Hardware, the bane of local LLM serving

We have all heard or read about the hardware needs of running a language model, of any size, locally. It needs some GPUs and an overall good supporting hardware. Unfortunately I have none of these at home. My private development playground setup is old (because I like recycling older machines but this is another story).

Hardware setup:

CPU: Intel i7, 4th Gen

GPU: (1) Nvidia GTX 980

Storage: 250GB SSD

Ram: 16GB

The setup above is hardly the ideal setup but I gave it a go with Gemma3 27b which needs 1 GPU (just not the one I have).

One thing that caught my ear were the fans that started working hard and it took a while to get an answer back but eventually the answer came and the results were astonishing and disappointing at the same time, or were they?

I RAG, you RAG, we all RAG!

Anyone who has spent "5 minutes" reading about LLM and LLM based applications has read about Retrieval Augmented Generation but just in case you haven't:

Retrieval-augmented generation is a technique for enhancing the accuracy and reliability of generative AI models with information from specific and relevant data sources.

I downloaded all the shop data in a csv file, broke the csv lines into individual text files of only the fields I needed, gave each file a unique file name based on a simple hash from the file contents and fed that, as an embedding, into chromadb. I then proceeded to supply the chosen model the data from the db as additional data in order to augment the results of my query with accurate and relevant up to date data. I used Langchain to achieve this by changing the relevant bits of an example I found online.

Data

The data in each file looks like this:

In fact, in the shop, the description includes all of that text and more (this is just a sample) and all I want to work on is the paragraph labeled Description.

Results may vary (and expectations may be shattered)

I used 3 different models with the same prompt:

You are the owner of an Etsy shop and you are trying to make your product descriptions better. Provide any answers in well formed text.

I used the Gemma3 27b, Gemma3 1b and Mistral models.

After the role above was set, my question/request was:

Get me the description for Pineapple Opal.

This is one of the product listings in my wife's shop. What I got back was a beautified (well formed) and processed, even somewhat improved, version of the original listing. I wanted the original listing and so I asked for it:

Get me the description of Pineapple Opal as is

And this had a very interesting and different outcome in each of the three models

Gemma3 27b

The model brought the description from the "Details" paragraph onward.

Gemma3 1b

The model brought the whole description as is (all paragraphs).

Mistral

The model brought the description of another opal listing, which is interesting because it seems that it "focused" on the material rather than the shape and the material.

Enhance my data

Once I understood the error of my ways, the next part was easy. In a chat interface I provided the role, the product description and I asked for an SEO optimized description instead with an air of summertime. The results were astonishing:

Gemma3 27b

The model made a very accurate, formal and professional description as if the person to wear the pineapple opal is going to a professional conference in the Bahamas.

Gemma3 1b

The model made a very accurate, informal and youthful description as if the person to wear the pineapple opal is going for a summer vacation in the Bahamas (which is what I wanted to see).

Mistral

The model acted like the Gemma3 27b but kept insisting on a white cross opal although the pineapple opal was fed to it 🤦.

Conclusion of the POC and lesson learned

It is my understanding that initial result of data coming in processed, is part of the role assigned and explained as "Provide well formed text". However, the "as is" part was mildly baffling at first .

Why would a 1b model do as it told whilst a 27b not? And then it hit me. RAG is all about providing more reference, building results as per sentiment, style and character of the question and the role. It is not meant to be used as is. In fact RAG is not even the right use for my needs. If I had a large knowledge base then it would have made sense. RAG is the perfect tool to enrich LLM responses on specific data the LLM has no prior knowledge (=not trained by that data) off. However on a product by product basis it had no point. Each product should have it's own query in a chat box.

As for the different sentiments and styles, it occurred to me that on the final app I will create for my wife, I should provide her with a means to change models and play with the results to her satisfaction.

Moreover the prompts must be preset in a list (maybe via MCP?) but with the ability to edit them per request or even enhance them and add them to the list of pre created prompts.

Search This Blog

Alberto on code life