Here are the suprising results:

We asked 5 language models what to buy for Black Friday

27.11.2024, Thomas Walter

A colleague working behind his computer and supportive screen
A colleague working behind his computer and supportive screen

Have you ever used ChatGPT to research or select products? According to Statista, 10% of office workers already use ChatGPT multiple times a day. With 3.7 billion monthly visitors worldwide, ChatGPT’s reach and influence as a single tool are growing almost daily.

As large language models (LLMs) become increasingly relevant in our everyday lives, it’s not a bold claim to say that they are fundamentally transforming how we discover, evaluate, and purchase products. And yet, this topic has barely received any attention.

While search engines, SEO, and SEA have long been the gold standard of product marketing, language models like OpenAI´s ChatGPT, Microsoft Copilot, and Google Gemini are emerging as new, personalized advisors for millions of consumers. But how do these models decide which products and brands to recommend? There won’t be a clear answer — by design, LLMs are too much of a black box. But hey, it’s still worth giving it a hands-on try, right?

Our experiment delivers surprising answers—and shows why brands urgently need to act to remain relevant in this new era of shopping.

 

The Experiment: LLMs as Personal Black Friday Shopping Advisors

We asked five of the leading language models – ChatGPT 4o, Claude 3.5 by Anthropic, Microsoft Copilot, Google Gemini and Perplexity - what they would recommend for Black Friday. Over the course of seven days, all five were given the same task (see the screenshot below). For every of their recommendation, we documented the recommended brand, product, product type, price, the LLMs remarks and the cited sources. We invite you to pause after this next question and note your recommendations – this is your chance to beat the machine. The following question was asked:

“Hey, Black Friday is coming up in a week and I have a budget of 800 EUR in total to spend. Can you please help me picking 5 products from 5 different categories. 1 electronics item, 1 fashion item, 1 home appliances item, 1 cosmetics item, 1 toy for children age 5-10. I need to stay in my overall budget but am looking for the best brands and good deals. So, what products can you recommend for me in each category and why? Thanks!”

 

The results were surprising: they not only reveal which brands and products were preferred but also provid fascinating insights into how LLMs function — from brand preferences and trending product categories to amusing calculation errors.

While these findings are merely a snapshot from the week leading up to Black Friday 2024, they paint a clearer picture of how LLMs prioritize purchasing decisions. More importantly, they highlight the challenges and opportunities this technology presents for brands.


In a nutshell: 7 days, 5 language models, 5 shopping categories—that adds up to 175 brand and product recommendations, complete with prices and source citations. It’s not enough to qualify as a full-fledged market study, but it’s a compelling snapshot of the week before Black Friday 2024. If you redo the experiment your personal results might differ slightly, and of course it matters if you have a pro account. Just while we ran the experiment, Perplexity launched integrated shopping as a pro-functionality. Our experiment doesn’t claim scientific rigor but serves as a snapshot that illustrates how LLMs can influence purchasing decisions—and what challenges and opportunities this technology holds for brands.

 

“And the LLM-Brand Recognition Award go to…”

💻 Electronics: Apple and Samsung Lead the Pack

The electronics category was dominated by two powerhouse brands: Apple (40% of all possible mentions) and Samsung (26%). Together, they received most of the attention, while Sony and Panasonic appeared sporadically. Other strong contenders, e.g. Bang & Olufsen—a name synonymous with premium audio—were entirely absent. This is particularly surprising since the Danish company performs exceptionally well in traditional SEO-driven searches.

The most dominant product category? Headphones! Apple’s AirPods (2nd Gen) and Sony’s WH-1000XM4 emerged as the LLM´s hot Black Friday picks. The hypothesis here is that language models heavily prioritize bestsellers, showing little interest in innovation or exploration.

👜 Fashion: Levi’s Surprises, Nike Is Invisible

Levi’s unexpectedly emerged as the clear winner in the fashion category—particularly with its iconic Levi’s 501 Original Jeans, which were recommended multiple times. Honestly, I hadn’t thought about a 501 since I still fit in them in the late ‘90s, so that was a personal surprise for me. Dr. Martens and Adidas also received mentions, though far fewer.

What’s particularly striking: major fashion labels like Gucci and Prada, as well as the sportswear titan Nike, were entirely absent. In a traditional Google search, this would be unthinkable. This suggests that even global giants can struggle to achieve visibility in the recommendation algorithms of language models.

A curious detail: Levi’s wasn’t confined to fashion alone. It also appeared in the electronics category—with a phone case. A small but telling hint of how deeply the brand is ingrained in the minds of these models.

🔌 Home Appliances: AI can´t eat but loves Air Fryers

The home appliances category displayed greater variety, but Dyson stood out as one of the winners. From vacuums to hair dryers, Dyson managed to score mentions across multiple categories. The dominant product category, however, was the air fryer, with Philips and Ninja leading the charge.

It’s worth noting that language models showed little innovation here as well. Nearly all recommendations focused on established bestsellers, while newer technologies, particularly smart home devices, were almost entirely overlooked.

🌸 Cosmetics: Clinique and Estée Lauder Dominate

The cosmetics category offered diversity, but Clinique (23%) and Estée Lauder (26%) emerged as the clear frontrunners. Other brands, such as Maybelline and Charlotte Tilbury, also received mentions. However, one big surprise was the total absence of industry giants like Nivea. As one of the most recognized skincare brands worldwide, and something even more dominant in Germany (the origin of the experiment), Nivea´s omission suggests that LLMs do not automatically prioritize global names nor local preferences yet.

Another interesting detail: Dyson’s hair dryer also appeared in the cosmetics category—another hint that brands can successfully position their products across multiple categories.

🧸Toys: Lego, Lego, Lego—and Nothing Else!

If this experiment were to crown a single brand as the ultimate winner, it would undoubtedly be Lego. In the toys category, Lego set an unparalleled example of brand dominance. With 35 out of 35 possible mentions, no other toy brand—be it Playmobil, Fisher-Price, or Schleich—was even considered by the five language models. The models’ total brand loyalty demonstrates that for LLMs Lego is synonymous with “a toys for kids aged 5 to 10.” The key questiuon here is: Are Lego aware of this? The financial leverage of this perception is enormous—borderline absurd.

Or put another way, a note to the marketing managers at Fisher-Price, Playmobil, Nintendo, or Schleich: Any thoughts on this?

What’s particularly interesting is the product variation within Lego’s portfolio. Of the 35 mentions, the language models recommended 20 different Lego sets, ranging from the Star Wars Millennium Falcon to Harry Potter and Lego City. This highlights not only Lego’s dominance but also its ability to cater to a wide variety of interests and preferences.

Conclusion from a Brand Perspective

Honestly, we didn’t expect this. You could say the LLM recommendations were fairly uninspired, standard suggestions across all categories, with a clear tendency to appeal to the average consumer or middle ground. The language models demonstrated a relatively high degree of brand loyalty to the dominant players in each category but showed absolutely no inclination toward extravagance or luxury. Or how would you rate these suggestions? 

Brand mentions per categories, all LLMs

The LLMs Performance as Shopping Assistant

Let’s now examine the results from the perspective of the LLMs themselves. Beyond the brands and products they favored, we were particularly interested in two questions: How do the five models allocate the budget across the five categories? And do they actually stick to the 800-euro limit? In addition, we analyzed which sources the models relied on for their recommendations. The introductory graphic illustrates how the five models handled the budget over the course of seven days.

Illustration how the five models handled the budget over the course of seven days.

ChatGPT: The Generalist, Without Major Flaws

OpenAI´s ChatGPT proved to be versatile, offering a well-rounded mix of well-known brands and products. It struggled with budget adherence on two days but was less prone to calculation errors compared to other models. Overall, it acted as a reliable, though not particularly innovative, advisor.

ChatGPT also stands out as the teacher’s pet among the LLMs when it comes to citing sources. On average, it named 10 sources to back up its five product recommendations, setting a benchmark for thoroughness.

Claude: Solid, but No Fan of Black Friday

Claude initially refused to participate in the experiment, citing ethical concerns about Black Friday. However, it eventually joined in and mostly recommended bestsellers. While dependable, it was far from surprising.

Claude's recommendations, according to its own claims, were based on data up to June 2024. Still, it confidently made statements like, "This is the best headphone deal." Overall, Claude’s performance was a bit confusing, and shopping assistance does not appear to be its strong suit.

Microsoft Copilot: Math Skills Stuck in 1st Grade?

Copilot delivered the most inconsistent results of all the language models. In addition to repeated and sometimes significant budget overruns, it was the glaringly obvious calculation errors—think 2 + 2 = 3—that stood out immediately (for a laugh, see the screenshot).

The recommendations themselves were heavily focused on bestsellers, but the lack of budget adherence undermined the credibility of its price suggestions. Sources were sparse and often provided only when explicitly requested. At least Copilot displayed a certain charm when confronted about its mistakes.

Illustration how the five models handled the budget over the course of seven days.

Google Gemini: Brand Loyalty or Just Laziness?

Gemini was the most brand-loyal model, frequently repeating the same brands and products over multiple days. While this conveys reliability, it also highlights Gemini’s lack of interest in variety. Once again, bestsellers dominated its recommendations, with little mention of innovative or lesser-known products. Gemini also struggled at times with basic math, particularly when adding up the five category prices.

One curious behavior emerged when Gemini was confronted with its calculation errors, as shown in the attached video. When asked to address the mistake, Gemini naturally corrected its math but immediately switched its recommendations to much cheaper alternatives—such as H&M 3-packs of t-shirts fo 15€, an €8 lipstick, or a cheap Playmobil set. Notably, none of these alternatives had appeared in its initial suggestions. 

Perplexity: The Budget Expert

Perplexity was the only model to stay within budget. Its recommendations were solid and drew from a wider variety of sources, but they weren’t particularly innovative.

What’s fresh and exciting: Just while we ran the experiment, Perplexity launched a new AI-generated shopping assistant called "Shop Like a Pro."This new feature aims to enable product checkout directly within the chat, removing the need to visit a retailer or brand shop. It’s a radical move against traditional affiliate models and undoubtedly the most intriguing development among language models in the context of this experiment.

Under the lens of sources, Perplexity displayed one particularly odd behavior on a single day. While its product recommendations were unremarkable (the usual suspects), its cited sources were bafflingly unrelated, seemingly hallucinating around the terms “Black” and “Friday.” These included Pearl Jam music videos and wine retailers… well, at least a chance to listen to some good old Eddie Vedder I guess.

Where and How Do the Language Models Spend Their Money?

It’s also fascinating to see how the models allocate their budgets across categories. For example, ChatGPT appears to prioritize home appliances more than other models, allocating a larger share of its budget to this category compared to fashion or cosmetics.

The following graphic illustrates the proportion of the budget that each category consumed for each model on a daily basis, normalized to the total spending per day (not the 800-euro limit). Even with this normalization, clear trends emerge: the significant overspending by Copilot and Gemini in electronics becomes apparent, while Perplexity stands out for distributing its budget most evenly across all categories.

Illustration how the five models handled the budget over the course of seven days.

The Big Insight: Is LLMO Becoming the Key Strategy of the Future?

Beyond the occasionally amusing results regarding language models, budgets, and brands, one central insight emerges from the experiment: brands seem to lack a strategy for dealing with language models. The results clearly indicate that brands urgently need to adopt a new form of optimization: LLMO – Large Language Model Optimization.

Unlike SEO (Search Engine Optimization), which focuses on visibility on Google or Bing, LLMO is about being present in the recommendation algorithms of language models. This is an entirely new field, yet to be fully explored, which also presents a massive opportunity.

Why Could LLMO Become the New Must-Have?

Let’s put this into perspective. Imagine 4 million people using language models for their Black Friday research and potentially following the recommendations. If the average shopping cart value is €250, we’re talking about a total economic value of €1 billion during a single Black Friday week. Given the user numbers of ChatGPT, this is a very conservative estimate. For Germany, you could possibly multiply that figure by 10, and for the U.S., maybe even by 50.

Brands that perform poorly in the algorithms of language models risk becoming invisible and losing direct access to a massive market. The stakes are enormous.

LLMO: How Do You Start?

But how can brands get started now? This field is still in its infancy, but by conducting an initial analysis of the sources language models rely on, we can begin to outline meaningful strategies for LLMO. Here are three tips for getting started:

1.      Be prominently represented on key international retailer platforms: From MediaMarkt to Walmart or Toys“R”Us – language models frequently cited international retailers as their #1 source.

2.      Get mentioned regularly on leading tech and niche-specific blogs: From TechRadar to Wired or GQ – magazines and blogs featuring articles like “Top 10 Black Friday Deals” were frequently quoted.

3.      Clearly position your product lines and categories as dominant in your fields: Language models favor precise wording. Many sources mirrored our questions almost exactly, using terms like “home appliances” or “€800.”

Just like the early days of SEO, there are no established best practices for LLMO yet. Brands that start investing in this topic now have a significant opportunity. Consumers are changing how they make purchasing decisions – and language models are becoming a key factor. They are increasingly taking over the role of traditional search engines, and with them, the rules of the game are changing.

So, the only question left to ask is: “What´s your LLMO strategy?”

You might also like: