Language models can self-correct—if you ask them
The second test uses a dataset designed to test the likelihood that a model will assume someone’s gender in a particular profession, and the third test to test the influence of race to a candidate’s chances of being accepted as a law school applicant if a language model was required to make the selection—something that fortunately does not happen in the real world.
The team found that simply prompting a model to ensure its responses did not rely on stereotypes had a significant positive effect on its output, especially in models that had completed the full cycle. RLHF and there are more than 22 billion parameters, the variables in an AI System are adjusted during training. (The more parameters, the larger the model. GPT-3 has about 175 million parameters.) In some cases, the model even begins to engage in positive discrimination in its output.
Crucially, like a lot of deep learning work, researchers don’t really know exactly why models are able to do this, although they do have some hunch. “As the models get bigger, they also have larger training data sets, and within those datasets there are a lot of examples of biased or patterned behavior,” said Ganguli. “That bias increases with model size.”
But at the same time, somewhere in the training data there must be some examples of people fighting this bias—perhaps in response to nasty posts on sites like Reddit or Twitter, for example. . Wherever a weaker signal originates, human feedback helps the model strengthen that signal when prompted to give an unbiased response, says Askell.
The work raises the obvious question of whether this “self-correction” can and should be included in language models in the first place.