Google Whitepaper: The Art and Science of Prompt Engineering

Google released a comprehensive whitepaper on prompt engineering authored by Lee Boonstra (September 2024). This post distills the key insights from that 65-page document - a practical guide to getting better results from Large Language Models.

1. Introduction

When thinking about LLM input and output, a text prompt (sometimes accompanied by other modalities such as images) is the input the model uses to predict a specific output.

You don’t need to be a data scientist or a machine learning engineer - everyone can write a prompt.

However, crafting the most effective prompt can be complicated. Many aspects affect its efficacy: the model you use, the model’s training data, model configurations, word choice, style, tone, structure, and context.

Prompt engineering is therefore an iterative process. Inadequate prompts can lead to ambiguous, inaccurate responses and hinder the model’s ability to provide meaningful output.

2. What is Prompt Engineering?

Remember how an LLM works: it’s a prediction engine. The model takes sequential text as input and predicts what the following token should be, based on the data it was trained on. The LLM does this repeatedly, adding each predicted token to the sequence for predicting the next one.

When you write a prompt, you’re attempting to set up the LLM to predict the right sequence of tokens.

Prompt engineering is the process of designing high-quality prompts that guide LLMs to produce accurate outputs. This involves tinkering to find the best prompt, optimizing prompt length, and evaluating writing style and structure in relation to the task.

Prompts can be used for various tasks: text summarization, information extraction, question and answering, text classification, language or code translation, code generation, and code documentation and reasoning.

3. LLM Output Configuration

Once you choose your model, you need to figure out the model configuration. Most LLMs come with various configuration options that control output. Effective prompt engineering requires setting these optimally for your task.

3.1 Output Length

An important setting is the number of tokens to generate. Generating more tokens requires more computation, leading to higher energy consumption, potentially slower response times, and higher costs.

Important: Reducing output length doesn’t make the LLM more stylistically succinct - it just causes the LLM to stop predicting once the limit is reached. If you need short output, you’ll also need to engineer your prompt to accommodate.

3.2 Sampling Controls

LLMs don’t formally predict a single token. Rather, they predict probabilities for what the next token could be. These probabilities are then sampled to determine the output token.

Temperature, top-K, and top-P are the most common settings that determine how predicted token probabilities are processed.

3.2.1 Temperature

Temperature controls the degree of randomness in token selection:

Temperature 0 (greedy decoding): Deterministic - highest probability token always selected
Low temperature (0.1-0.3): More deterministic, factual responses
High temperature (0.7-1.0): More diverse, creative, unexpected results
Very high (>1.0): All tokens become equally likely

3.2.2 Top-K and Top-P

These sampling settings restrict the predicted next token to come from tokens with top predicted probabilities.

Top-K sampling selects the top K most likely tokens. Higher top-K means more creative and varied output; lower top-K means more restrictive and factual output. Top-K of 1 is equivalent to greedy decoding.

Top-P sampling (nucleus sampling) selects top tokens whose cumulative probability doesn’t exceed P. P = 0 means only the most probable token (greedy decoding); P = 1 means all tokens in vocabulary are considered.

3.2.3 Putting It All Together

When temperature, top-K, and top-P are all available, tokens meeting both top-K and top-P criteria become candidates, then temperature is applied to sample from those candidates.

Extreme settings cancel out other configurations: Temperature = 0 makes top-K and top-P irrelevant; Top-K = 1 makes temperature and top-P irrelevant; Top-P = 0 makes temperature and top-K irrelevant.

Recommended starting points:

Coherent, moderately creative: Temperature 0.2, Top-P 0.95, Top-K 30
Especially creative: Temperature 0.9, Top-P 0.99, Top-K 40
Less creative, factual: Temperature 0.1, Top-P 0.9, Top-K 20
Single correct answer (math): Temperature 0

4. Prompting Techniques

LLMs are tuned to follow instructions and trained on large amounts of data. But they aren’t perfect - the clearer your prompt, the better the LLM can predict the right text.

4.1 General Prompting / Zero-Shot

A zero-shot prompt is the simplest type. It only provides a description of the task and some text for the LLM to get started with. The input could be a question, start of a story, or instructions. “Zero-shot” means “no examples.”

For example, you could ask the model to classify a movie review as POSITIVE, NEUTRAL, or NEGATIVE without showing it any examples first. The model uses its training to understand what you want.

Pro tip: Use low temperature (0.1) since no creativity is needed for classification tasks.

4.2 One-Shot & Few-Shot Prompting

When zero-shot doesn’t work, provide examples in the prompt.

One-shot prompt: Provides a single example for the model to imitate.

Few-shot prompt: Provides multiple examples showing a pattern to follow.

How many examples? It depends on complexity of the task, quality of the examples, and capabilities of the model. General rule: use at least 3-5 examples for few-shot prompting. More complex tasks may need more; input length limitations may require fewer.

Guidelines for choosing examples:

Use examples relevant to your task
Examples should be diverse and high quality
One small mistake can confuse the model
Include edge cases for robust output

4.3 System, Contextual, and Role Prompting

These three techniques guide how LLMs generate text, focusing on different aspects:

System Prompting sets overall context and purpose - the “big picture” of what the model should do. System prompts define the model’s fundamental capabilities and can specify output requirements like format or safety guidelines.

Role Prompting assigns a specific character or identity to the model, affecting output style, voice, and personality. For example, asking the model to “act as a travel guide” helps it generate responses consistent with that role’s knowledge and behavior. You can also specify styles: confrontational, descriptive, formal, humorous, inspirational, persuasive, etc.

Contextual Prompting provides specific details or background information relevant to the current task, helping the model understand nuances and tailor responses accordingly. This ensures the model quickly understands your request and generates more accurate, relevant responses.

There can be considerable overlap between these three types.

Benefits of structured output (like JSON):

No manual format creation needed
Data can be returned in sorted order
Forces structure and limits hallucinations

4.4 Step-Back Prompting

Step-back prompting improves performance by prompting the LLM to first consider a general question related to the specific task, then feeding that answer into a subsequent prompt.

This “step back” activates relevant background knowledge and reasoning before attempting to solve the specific problem.

Benefits:

More accurate and insightful responses
Encourages critical thinking
Applies knowledge in new, creative ways
Can mitigate biases by focusing on general principles

For example, before asking “Write a storyline for a new video game level,” first ask “What are 5 key settings that contribute to challenging and engaging game levels?” Then use that response as context for your specific request.

4.5 Chain of Thought (CoT)

Chain of Thought prompting improves reasoning by generating intermediate reasoning steps. This helps the LLM generate more accurate answers, especially for mathematical and logical problems.

LLMs often struggle with mathematical tasks because they’re trained on text, and math requires a different approach. The simple addition of “Let’s think step by step” to your prompt can dramatically improve accuracy.

Advantages of CoT:

Low-effort but very effective
Works with off-the-shelf LLMs (no fine-tuning needed)
Provides interpretability - you can see the reasoning steps
Improves robustness between different LLM versions

Disadvantages:

More output tokens = higher cost and longer response times

Few-shot CoT is even more powerful - provide examples that include the reasoning steps, not just the final answer.

Good candidates for CoT:

Code generation (breaking requests into steps)
Synthetic data creation
Any task that can be solved by “talking through it”

Best practices for CoT:

Put the answer after the reasoning
Extract the final answer separately from the reasoning
Set temperature to 0 (CoT is based on greedy decoding)

4.6 Self-Consistency

While LLMs have shown impressive success, their reasoning ability is often limited. CoT helps but uses simple “greedy decoding.” Self-consistency combines sampling and majority voting for better results.

How it works:

Generate diverse reasoning paths: Provide the same prompt multiple times with high temperature to encourage different approaches
Extract the answer from each generated response
Choose the most common answer (majority voting)

Self-consistency gives a pseudo-probability likelihood of an answer being correct, but obviously has higher costs due to multiple runs.

4.7 Tree of Thoughts (ToT)

Tree of Thoughts generalizes CoT by allowing LLMs to explore multiple different reasoning paths simultaneously, rather than following a single linear chain.

How it works:

Maintains a “tree of thoughts”
Each thought is a coherent language sequence serving as an intermediate step
The model explores different reasoning paths by branching from different nodes

While CoT follows a single linear path from input to output, ToT branches out into multiple parallel paths before converging on the final output.

ToT is particularly well-suited for complex tasks requiring exploration.

4.8 ReAct (Reason + Act)

ReAct is a paradigm for enabling LLMs to solve complex tasks using natural language reasoning combined with external tools (search, code interpreter, APIs). This allows LLMs to perform actions like interacting with external APIs to retrieve information.

ReAct is a first step towards agent modeling.

ReAct mimics how humans operate: we reason verbally and take actions to gain information.

How it works:

LLM reasons about the problem and generates a plan
Performs actions in the plan
Observes the results
Updates reasoning and generates a new plan
Continues until reaching a solution

For example, when asked “How many kids do the band members of Metallica have?”, a ReAct agent would search for who the band members are, then search for each member’s children, then aggregate the results.

4.9 Automatic Prompt Engineering (APE)

Writing prompts can be complex. Automatic Prompt Engineering automates this by using a model to write prompts for you.

Process:

Generate: Prompt a model to generate output variants
Evaluate: Score candidates using metrics (BLEU, ROUGE)
Select: Choose the highest-scoring candidate
Iterate: Tweak and evaluate again

This is particularly useful for generating training data variations or finding optimal prompt formulations.

5. Code Prompting

Gemini and other LLMs can help with writing code in any programming language. This can significantly speed up development.

5.1 Prompts for Writing Code

You can ask LLMs to write code by describing what you need. Be specific about the programming language and the desired functionality.

Important: Since LLMs can’t truly reason and may repeat training data patterns, always read and test generated code before using it.

5.2 Prompts for Explaining Code

When working in teams, you often need to read others’ code. LLMs can help explain unfamiliar code by breaking down each section and describing what it does.

5.3 Prompts for Translating Code

Need to convert code between languages? LLMs can translate code from one programming language to another while maintaining the same functionality.

5.4 Prompts for Debugging and Reviewing Code

LLMs can help debug code by identifying bugs from error messages and suggesting fixes. They can also recommend improvements like better error handling, cleaner syntax, or edge case handling.

5.5 What About Multimodal Prompting?

Prompting for code uses regular language models. Multimodal prompting is a separate concern - it refers to using multiple input formats (text, images, audio, code) depending on the model’s capabilities and the task.

6. Best Practices

Finding the right prompt requires tinkering. Use these best practices to become a pro.

6.1 Provide Examples

The most important best practice. Few-shot prompting is highly effective because examples act as a powerful teaching tool. They showcase desired outputs, allowing the model to learn and tailor its generation accordingly. It’s like giving the model a reference point or target to aim for.

6.2 Design with Simplicity

Prompts should be concise, clear, and easy to understand for both you and the model.

Rule of thumb: If it’s confusing for you, it will likely be confusing for the model.

Use action verbs: Act, Analyze, Categorize, Classify, Compare, Create, Describe, Define, Evaluate, Extract, Find, Generate, Identify, List, Organize, Parse, Predict, Recommend, Retrieve, Rewrite, Select, Summarize, Translate, Write

6.3 Be Specific About the Output

A concise instruction might not guide the LLM enough or could be too generic. Specific details help the model focus on what’s relevant. Instead of “Generate a blog post about video game consoles,” say “Generate a 3 paragraph blog post about the top 5 video game consoles. The blog post should be informative and engaging, written in a conversational style.”

6.4 Use Instructions Over Constraints

Instructions tell the model what TO do. Constraints tell the model what NOT to do.

Research suggests positive instructions are more effective than constraints. This aligns with how humans prefer positive instructions over lists of what not to do.

Why instructions work better:

Directly communicate desired outcome
Give flexibility and encourage creativity
Constraints might leave the model guessing
Lists of constraints can clash with each other

When to use constraints: Preventing harmful or biased content, or when strict output format is required.

6.5 Control the Max Token Length

To control response length, set a max token limit in configuration, or explicitly request specific length in your prompt (e.g., “Explain quantum physics in a tweet-length message”).

6.6 Use Variables in Prompts

Make prompts dynamic and reusable by using variables that can be changed for different inputs. This saves time, makes integration easier, and lets you use the same prompt template with different inputs.

6.7 Experiment with Input Formats and Writing Styles

Different models, configurations, formats, and word choices yield different results. Experiment with style, word choice, and prompt type (zero-shot, few-shot, system prompt).

The same goal can be formulated as a question, a statement, or an instruction - try different approaches.

6.8 For Few-Shot Classification, Mix Up the Classes

When doing classification tasks, mix up the possible response classes in your examples. Otherwise, you might overfit to the specific order.

By mixing classes, you ensure the model learns to identify key features of each class, rather than memorizing example order.

Good rule of thumb: Start with 6 few-shot examples and test accuracy from there.

6.9 Adapt to Model Updates

Stay on top of model architecture changes, added data, and capabilities. Try newer model versions and adjust prompts to leverage new features.

6.10 Experiment with Output Formats

For non-creative tasks (extracting, selecting, parsing, ordering, ranking, categorizing), try structured output formats like JSON or XML. This forces structure and limits hallucinations.

6.11 Experiment Together with Other Prompt Engineers

If you need a good prompt, have multiple people make attempts. When everyone follows best practices, you’ll see variance in performance between different approaches. Collaboration leads to better prompts.

6.12 Document the Various Prompt Attempts

Document your prompt attempts in full detail so you can learn what works and what doesn’t.

Prompt outputs can differ across models, across sampling settings, across different versions of the same model, and even across identical prompts with small formatting or word choice differences.

Track these fields for each attempt:

Name and version of prompt
Goal (one sentence)
Model name and version
Configuration (Temperature, Token Limit, Top-K, Top-P)
Full prompt text
Output(s)
Result (OK / NOT OK / SOMETIMES OK)
Feedback notes

Additional tips:

Track iteration versions
Save prompts in separate files from code for easier maintenance
Rely on automated tests and evaluation procedures

7. Summary

This whitepaper covers a comprehensive set of prompting techniques:

Zero-shot and Few-shot prompting
System, Role, and Contextual prompting
Step-back prompting
Chain of Thought (CoT)
Self-consistency
Tree of Thoughts (ToT)
ReAct (Reason + Act)
Automatic Prompt Engineering

Key Principles to Remember:

Prompt engineering is iterative - craft, test, analyze, document, refine, repeat
Examples are powerful - few-shot often beats clever zero-shot
Chain of Thought unlocks reasoning - “Let’s think step by step”
Configuration matters - temperature, top-K, top-P affect output significantly
Document everything - you’ll need it when revisiting prompts
Test thoroughly - especially for code, always verify outputs
When you change a model or configuration, go back and re-experiment

This post summarizes key insights from Google’s “Prompt Engineering” whitepaper by Lee Boonstra (September 2024).

1. Introduction#

2. What is Prompt Engineering?#

3. LLM Output Configuration#

3.1 Output Length#

3.2 Sampling Controls#

3.2.1 Temperature#

3.2.2 Top-K and Top-P#

3.2.3 Putting It All Together#

4. Prompting Techniques#

4.1 General Prompting / Zero-Shot#

4.2 One-Shot & Few-Shot Prompting#

4.3 System, Contextual, and Role Prompting#

4.4 Step-Back Prompting#

4.5 Chain of Thought (CoT)#

4.6 Self-Consistency#

4.7 Tree of Thoughts (ToT)#

4.8 ReAct (Reason + Act)#

4.9 Automatic Prompt Engineering (APE)#

5. Code Prompting#

5.1 Prompts for Writing Code#

5.2 Prompts for Explaining Code#

5.3 Prompts for Translating Code#

5.4 Prompts for Debugging and Reviewing Code#

5.5 What About Multimodal Prompting?#

6. Best Practices#

6.1 Provide Examples#

6.2 Design with Simplicity#

6.3 Be Specific About the Output#

6.4 Use Instructions Over Constraints#

6.5 Control the Max Token Length#

6.6 Use Variables in Prompts#

6.7 Experiment with Input Formats and Writing Styles#

6.8 For Few-Shot Classification, Mix Up the Classes#

6.9 Adapt to Model Updates#

6.10 Experiment with Output Formats#

6.11 Experiment Together with Other Prompt Engineers#

6.12 Document the Various Prompt Attempts#

7. Summary#

1. Introduction

2. What is Prompt Engineering?

3. LLM Output Configuration

3.1 Output Length

3.2 Sampling Controls

3.2.1 Temperature

3.2.2 Top-K and Top-P

3.2.3 Putting It All Together

4. Prompting Techniques

4.1 General Prompting / Zero-Shot

4.2 One-Shot & Few-Shot Prompting

4.3 System, Contextual, and Role Prompting

4.4 Step-Back Prompting

4.5 Chain of Thought (CoT)

4.6 Self-Consistency

4.7 Tree of Thoughts (ToT)

4.8 ReAct (Reason + Act)

4.9 Automatic Prompt Engineering (APE)

5. Code Prompting

5.1 Prompts for Writing Code

5.2 Prompts for Explaining Code

5.3 Prompts for Translating Code

5.4 Prompts for Debugging and Reviewing Code

5.5 What About Multimodal Prompting?

6. Best Practices

6.1 Provide Examples

6.2 Design with Simplicity

6.3 Be Specific About the Output

6.4 Use Instructions Over Constraints

6.5 Control the Max Token Length

6.6 Use Variables in Prompts

6.7 Experiment with Input Formats and Writing Styles

6.8 For Few-Shot Classification, Mix Up the Classes

6.9 Adapt to Model Updates

6.10 Experiment with Output Formats

6.11 Experiment Together with Other Prompt Engineers

6.12 Document the Various Prompt Attempts

7. Summary