ChatGPT takes Twitter by storm

Last week, there was an important announcement in the NLP community. OpenAI released their new large language model (LLM) chatbot which they called ChatGPT on Wednesday, and it has since gathered more than 1 million users in less than a week.

Link

At the present time, it’s taking Twitter by storm and even convincing the most sceptical people. It has also been more successful than Meta’s attempt a few weeks ago with their model Galactica, which they had to take off the web soon after releasing due to the toxicity of its replies.

What has OpenAI done differently? The objective of any new LLM is to increase helpfulness and accuracy of the answers while reducing the amount of harm, biases and made-up facts (or “imitative falsehoods”). They have released it in a research preview mode, most probably to stress-test it with real users who will try to break it and find its flaws.

OpenAI seems to have added some safety mechanisms to make sure it stays clear from toxic content or does not try to answer anything beyond what it knows. These mechanisms could be a layer of rules applied on top of the model. As they say on their website,

A byproduct of training our models to follow user instructions is that they may become more susceptible to misuse if instructed to produce unsafe outputs. Solving this requires our models to refuse certain instructions (source)

Below are some examples of safety mechanism kicking in:

ChatGPT refusing to explain how to hotwire a car.

Link

ChatGPT answering a different related question.

Link

However, people have learnt how to bypass these safety checks soon after release. Please note we do not endorse or support any of the behaviour or opinions uttered by the bot.

How to get ChatGPT to give instructions on how to break into someone’s house.

Link

Some users used prompt injection by hacking the OpenAI propietary prompt to bypass some of the safety mechanisms.

Link

This user impersonated OpenAI to bypass the content filters.

Link

From a modelling point of view, one of the main ways they have reduced toxicity and made-up facts is by using humans in the loop using “reinforcement learning from human feedback” (RLHF). As they explain on their website,

On prompts submitted by our customers to the API, our labelers provide demonstrations of the desired model behavior, and rank several outputs from our models. We then use this data to fine-tune GPT-3. (source)

They call this new family of models that have been guided by humans the InstructGPT models. Crucially, these models perform better in people’s eyes with 100x fewer parameters than the original GPT-3 model.

At DataQA, we are very excited by the power this technology has and believe this will unlock a lot of new capabilities...