Newsletter: Google's Bard vs OpenAI's ChatGPT
How well does Bard and PaLM 2 compare to GPT-4 when building LLM Products
Google has made waves in the tech scene yet again, releasing Bard, their contender in the language model battleground, to the public at the recent I/O event. Since its release, people are seeing how it compares to OpenAI’s ChatGPT. We are no different. In today’s newsletter, we compare Bard with ChatGPT, measuring their strengths and weaknesses and determining if Bard has what it takes to unseat GPT4 from its throne as the premier language model for product development.
Bard vs ChatGPT
The moniker ‘Bard’ is what Google calls their LLM chat interface and does not correspond to any algorithm. However, its roots lie in a variant of Google’s LaMDA, a large language model (LLM) launched two years ago. The Bard chatbot was initially based on a lightweight version of LaMDA.
However, at I/O, Google announced that BARD will now transition to using PaLM 2, their latest LLM. The new model, boasting “improved multilingual, reasoning, and coding capabilities”, isn’t restricted to Bard; it’s set to power over 25 other products and features. Nevertheless, two days after the announcement, it’s still unclear whether Bard is using PaLM 2 or operating on LaMDA.
Google is betting big on PaLM 2. In terms of performance, PaLM 2 has been trained in more than 100 languages. This should give it an expansive vocabulary and a deep understanding of linguistic subtleties and nuances, helping it to comprehend and translate a wide variety of text. According to them, PaLM 2 was able to pass multiple language proficiency tests at the “mastery” level.
PaLM 2 was also trained on more than 20 programming languages, even more, specialized languages like Fortran and Verilog. This should make it very capable at writing code. Further, its training incorporated a broad spectrum of scientific literature and webpages, purportedly enhancing its performance in reasoning, common sense, logic, and even mathematical tasks.
Google said that the performance improvement comes from using compute-optimized scaling, a more extensive and diverse training dataset, and an updated model architecture complemented by an enhanced training methodology.
Unfortunately, because of the closed-source approach companies are taking with LLMs, we don’t know the exact performance and quality scores of PaLM 2 and GPT-4 models on these tasks. We also don’t know the datasets used, the architecture improvements made, or how the model was trained. We don’t even know how many parameters these models have or how many tokens they were trained on!
As observers trying to understand the nuances of taking LLMs to production and building LLM-based applications, we know that it’s not just about the performance scores on datasets (though we’re definitely interested in those, too!). Since many companies will use the base version of the model via the API, it is crucial that the model’s output is versatile across a wide range of topics and provides unbiased, factual information. For a model to be truly effective in production, it needs to demonstrate the following:
Factual Accuracy: Are the outputs reliable and accurate? Does the model cite authentic sources or generate misleading or false information?
Hallucinations: How frequently does the model confidently produce incorrect responses?
Ethics and Integrity: Are the responses ethical? Does the model suggest actions that are malicious or illegal?
Bias: Does the model exhibit bias towards underrepresented communities?
These critical aspects, and much more, are explored in-depth in our latest blog post.
MLOps & ML Pulse Check: A Weekly Synthesis
The biggest news of the week is PaLM 2. Not only is Google releasing PaLM 2 APIs and integration with Vertex AI, but they are also releasing two fine-tuned models. Med-PaLM 2 was trained on health research data, and Sec-PaLM was trained for security use cases.
Google also announced four sizes of their PaLM model, starting from their smallest model Gecko which can run on mobile devices, to their largest model Unicorn. Filling the gap between these two extremes are ‘Otter’ and ‘Bison’, each offering a unique balance of performance and computational demand.
A month ago, we wrote an article about the high costs associated with LLM APIs and how reliance on APIs may make your business unsustainable as it increases in complexity. And a few days ago, Matei Zaharia and his team showed that you could reduce LLM API costs by 98% and improve accuracy by 4% simultaneously. Check out their paper FrugalGPT.