Prompt Engineering 101

Wikipedia Definition: Prompt engineering is a concept in artificial intelligence (AI), particularly natural language processing (NLP). In prompt engineering, the description of the task that the AI is supposed to accomplish is embedded in the input, e.g., as a question, instead of it being implicitly given. Prompt engineering typically works by converting one or more tasks to a prompt-based dataset and training a language model with what has been called “prompt-based learning” or just “prompt learning”.

In this notebook I will use the OpenAI APIs and SerAPI together with LangChain, a framework for developing applications powered by language models that allow us to connect a language model to other sources of data.

Figure below by DAIR.AI | Elvis Saravia, describe the elements of a prompt.

This notebook is inspired by “Getting Started with Prompt Engineering” from DAIR.AI | Elvis Saravia. Please check Elvis repository containining the video lecture and code here

Installing packages needed.

%%capture
# update or install the necessary libraries
!pip install --upgrade openai
!pip install --upgrade langchain
!pip install --upgrade python-dotenv
!pip install --upgrade pypdf
!pip install --upgrade faiss-cpu

import openai
import os
import IPython
from langchain.llms import OpenAI
from dotenv import load_dotenv

Load environment variables.

Usingpython-dotenv which make use of the .env file with the OPENAI_API_KEY and the SERPAPI_API_KEY.

load_dotenv()

# API configuration
openai.api_key = os.getenv("OPENAI_API_KEY")

# for LangChain
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
os.environ["SERPAPI_API_KEY"] = os.getenv("SERPAPI_API_KEY")

def set_open_params(
    model="text-davinci-003",
    temperature=0.7,
    max_tokens=256,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
):
    """ set openai parameters"""

    openai_params = {}    

    openai_params['model'] = model
    openai_params['temperature'] = temperature
    openai_params['max_tokens'] = max_tokens
    openai_params['top_p'] = top_p
    openai_params['frequency_penalty'] = frequency_penalty
    openai_params['presence_penalty'] = presence_penalty
    return openai_params

def get_completion(params, prompt):
    """ GET completion from openai api"""

    response = openai.Completion.create(
        engine = params['model'],
        prompt = prompt,
        temperature = params['temperature'],
        max_tokens = params['max_tokens'],
        top_p = params['top_p'],
        frequency_penalty = params['frequency_penalty'],
        presence_penalty = params['presence_penalty'],
    )
    return response

Basic prompt example:

# basic example
params = set_open_params()

prompt = "The sky is"

response = get_completion(params, prompt)

response.choices[0].text

' blue\n\nThe sky is blue in color during the day and black at night when there is no sunlight.'

IPython.display.Markdown(response.choices[0].text)

blue

The sky is blue in color during the day and black at night when there is no sunlight.

Try with different temperature to compare results:

params = set_open_params(temperature=0)
prompt = "The sky is"
response = get_completion(params, prompt)
IPython.display.Markdown(response.choices[0].text)

blue

The sky is blue because of the way the atmosphere scatters sunlight. When sunlight passes through the atmosphere, the blue wavelengths are scattered more than the other colors, making the sky appear blue.

PAL (Program-aided Language Model) - Code as Reasoning

This is a simple application that’s able to reason about the question being asked through code.

Specifically, the application takes in some data and answers a question about the data input.

The prompt includes a few exemplars which are adopted from here.

# lm instance
llm = OpenAI(model_name='text-davinci-003', temperature=0)

#question = "Which is the youngest penguin?"
question = "What is the average age of the penguin?"

PENGUIN_PROMPT = '''
"""
Q: Here is a table where the first line is a header and each subsequent line is a penguin:
name, age, height (cm), weight (kg) 
Louis, 7, 50, 11
Bernard, 5, 80, 13
Vincent, 9, 60, 11
Gwen, 8, 70, 15
For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm. 
We now add a penguin to the table:
James, 12, 90, 12
How many penguins are less than 8 years old?
"""
# Put the penguins into a list.
penguins = []
penguins.append(('Louis', 7, 50, 11))
penguins.append(('Bernard', 5, 80, 13))
penguins.append(('Vincent', 9, 60, 11))
penguins.append(('Gwen', 8, 70, 15))
# Add penguin James.
penguins.append(('James', 12, 90, 12))
# Find penguins under 8 years old.
penguins_under_8_years_old = [penguin for penguin in penguins if penguin[1] < 8]
# Count number of penguins under 8.
num_penguin_under_8 = len(penguins_under_8_years_old)
answer = num_penguin_under_8
"""
Q: Here is a table where the first line is a header and each subsequent line is a penguin:
name, age, height (cm), weight (kg) 
Louis, 7, 50, 11
Bernard, 5, 80, 13
Vincent, 9, 60, 11
Gwen, 8, 70, 15
For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm.
Which is the youngest penguin?
"""
# Put the penguins into a list.
penguins = []
penguins.append(('Louis', 7, 50, 11))
penguins.append(('Bernard', 5, 80, 13))
penguins.append(('Vincent', 9, 60, 11))
penguins.append(('Gwen', 8, 70, 15))
# Sort the penguins by age.
penguins = sorted(penguins, key=lambda x: x[1])
# Get the youngest penguin's name.
youngest_penguin_name = penguins[0][0]
answer = youngest_penguin_name
"""
Q: Here is a table where the first line is a header and each subsequent line is a penguin:
name, age, height (cm), weight (kg) 
Louis, 7, 50, 11
Bernard, 5, 80, 13
Vincent, 9, 60, 11
Gwen, 8, 70, 15
For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm.
What is the name of the second penguin sorted by alphabetic order?
"""
# Put the penguins into a list.
penguins = []
penguins.append(('Louis', 7, 50, 11))
penguins.append(('Bernard', 5, 80, 13))
penguins.append(('Vincent', 9, 60, 11))
penguins.append(('Gwen', 8, 70, 15))
# Sort penguins by alphabetic order.
penguins_alphabetic = sorted(penguins, key=lambda x: x[0])
# Get the second penguin sorted by alphabetic order.
second_penguin_name = penguins_alphabetic[1][0]
answer = second_penguin_name
"""
{question}
"""
'''.strip() + '\n'

Now that we have the prompt and question. We can send it to the model. It should output the steps, in code, needed to get the solution to the answer.

llm_out = llm(PENGUIN_PROMPT.format(question=question))
print(llm_out)

# Put the penguins into a list.
penguins = []
penguins.append(('Louis', 7, 50, 11))
penguins.append(('Bernard', 5, 80, 13))
penguins.append(('Vincent', 9, 60, 11))
penguins.append(('Gwen', 8, 70, 15))
# Get the ages of the penguins.
ages = [penguin[1] for penguin in penguins]
# Calculate the average age.
average_age = sum(ages) / len(ages)
answer = average_age

exec(llm_out)
print(answer)

7.25

Prompt Engineering using LangChain

Example adopted from the LangChain documentation.

from langchain.agents import load_tools
from langchain.agents import initialize_agent

pip install google-search-results

Requirement already satisfied: google-search-results in /usr/local/lib/python3.10/site-packages (2.4.2)

Requirement already satisfied: requests in /usr/local/lib/python3.10/site-packages (from google-search-results) (2.28.2)

Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/site-packages (from requests->google-search-results) (2022.12.7)

Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/site-packages (from requests->google-search-results) (3.1.0)

Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/site-packages (from requests->google-search-results) (3.4)

Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/site-packages (from requests->google-search-results) (1.26.15)

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv



[notice] A new release of pip available: 22.2.2 -> 23.0.1

[notice] To update, run: pip install --upgrade pip

Note: you may need to restart the kernel to use updated packages.

llm = OpenAI(temperature=0)

tools = load_tools(["serpapi", "llm-math"], llm=llm)
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)

# run the agent
agent.run("Who is Olivia Wilde's boyfriend? What is his current age raised to the 0.23 power?")
#agent.run("Who is Luis Inacio Lula da Silva's girlfriend? What is her current age raised to the 0.23 power?")


> Entering new AgentExecutor chain...

 I need to find out who Olivia Wilde's boyfriend is and then calculate his age raised to the 0.23 power.

Action: Search

Action Input: "Olivia Wilde boyfriend"

Observation: Olivia Wilde started dating Harry Styles after ending her years-long engagement to Jason Sudeikis — see their relationship timeline.

Thought: I need to find out Harry Styles' age.

Action: Search

Action Input: "Harry Styles age"

Observation: 29 years

Thought: I need to calculate 29 raised to the 0.23 power.

Action: Calculator

Action Input: 29^0.23

Observation: Answer: 2.169459462491557



Thought: I now know the final answer.

Final Answer: Harry Styles, Olivia Wilde's boyfriend, is 29 years old and his age raised to the 0.23 power is 2.169459462491557.



> Finished chain.

"Harry Styles, Olivia Wilde's boyfriend, is 29 years old and his age raised to the 0.23 power is 2.169459462491557."

Data-Augmented Generation

In this section we are going to use external data (pdf file showing Amazon Sustainability Report from 2021) as a source to augument the search.

Code example adopted from LangChain Documentation. We are only using the examples for educational purposes.

Prepare the data first:

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings.cohere import CohereEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores.elastic_vector_search import ElasticVectorSearch
from langchain.vectorstores import Chroma, FAISS
from langchain.docstore.document import Document
from langchain.prompts import PromptTemplate

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("./example_data/2021-sustainability-report-amazon.pdf")
pages = loader.load_and_split()

print(pages[0])

page_content='Prime Air\nDelivering Progress\nEvery Day\nAmazon’s 2021 Sustainability Report' metadata={'source': './example_data/2021-sustainability-report-amazon.pdf', 'page': 0}

print(f'Amount of pages extracted from the document is {len(pages)}')

Amount of pages extracted from the document is 133

#with open('./state_of_the_union.txt') as f:
#    state_of_the_union = f.read()
#text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
#texts = text_splitter.split_text(state_of_the_union)

embeddings = OpenAIEmbeddings()

#docsearch = Chroma.from_texts(texts, embeddings, 
#metadatas=[{"source": str(i)} for i in range(len(texts))])
docsearch = FAISS.from_documents(pages[0:10], embeddings)

query = "When Amamzon will achieve net-zero ?"
docs = docsearch.similarity_search(query)

Let’s quickly test it:

from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.llms import OpenAI

chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type="stuff")
#query = "What did the president say about Justice Breyer"
query= "When Amamzon will achieve net-zero?"
chain({"input_documents": docs, "question": query}, return_only_outputs=True)

{'output_text': ' Amazon has committed to achieving net-zero carbon by 2040.\nSOURCES: 2021-sustainability-report-amazon.pdf'}

Let’s try a question with a custom prompt:

template = """Given the following extracted parts of a long document and a question, create a final answer with references ("SOURCES"). 
If you don't know the answer, just say that you don't know. Don't try to make up an answer.
ALWAYS return a "SOURCES" part in your answer.
Respond in Portuguese.

QUESTION: {question}
=========
{summaries}
=========
FINAL ANSWER IN PORTUGUESE:"""

# create a prompt template
PROMPT = PromptTemplate(template=template, input_variables=["summaries", "question"])

# query 
chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type="stuff", prompt=PROMPT)
query = "When Amamzon will achieve net-zero?"
chain({"input_documents": docs, "question": query}, return_only_outputs=True)

{'output_text': '\nA Amazon comprometeu-se a alcançar o carbono líquido zero até 2040, 10 anos antes do Acordo de Paris. Como parte dos esforços para descarbonizar a sua empresa, a Amazon tornou-se o maior comprador corporativo de energia renovável do mundo em 2020 e, no ano passado, atingiu 85% de energia renovável em todos os seus negócios. Estamos comprometidos a atingir o carbono líquido zero em todas as nossas operações até 2040.\n\nFONTE: ./example_data/2021-sustainability-report-amazon.pdf'}