How to Talk to Large Language Models¶

CyVerse Full Prompt Engineering Workshop

Learning Objectives

Fundamentals: How AI models process and respond to prompts
Modern Features: Leveraging document uploads, web search, and multi-modal inputs
Best Practices: Structured approaches to writing effective prompts
Advanced Techniques: Context management, chaining, and custom instructions

How is the AI revolution impacting Open Science?¶

Center for Open Science: Evaluating AI’s Impact on Open Research Infrastructure

"When used responsibly, it can support open models and data, accelerate discovery, and aid in the evaluation of research...
... but it can also undermine credibility when used to plagiarize, fabricate findings, or mislead readers. "

"With the emergence of generative AI, it has become very easy to create content that looks like a real research paper, but is not genuine research."

MDPI Blog: How Artificial Intelligence is Accelerating Open Access Science

"Artificial intelligence, like any new technology, presents both a threat and an opportunity. It requires reflection, adjustment, and adaptation.

If implemented carefully and thoughtfully, AI could help us respond to some of the issues that the Open Access scientific publishing industry currently faces. These include the increasing amounts of data being produced and also language barriers and imbalances in outputs between countries.

Further, AI could help to promote openness in datasets and content aggregators."

"GPT is not fully reliable.

GPT training involves analysing a huge body of text and noticing patterns so it can predict the next word in a passage.

This results in human-like text, meaning it sounds like it’s written by a human but may not necessarily be by one. Similarly, it may sound like it is conveying meaning, but the argument or claim being made may be without evidence or structure."

"Artificial intelligence is changing Open Access; it’s changing everything. Ultimately, though, it’s a tool, so how it’s used determines its value.

If used carefully, AI could help advance Open Access by automating repetitive data-related tasks, making the translation process more interactive, and promoting openness in datasets and content aggregators.

However, attention must be paid to its flaws and potential for misuse."

ducky — Rubber duck debugging (or rubberducking), Wikipedia

The AI revolution is here and it isn't going away any time soon. Tools such as ChatGPT, Machine Learning and Large Language Models (LLMs) present an opportunity that is as (probably) as impactful as the arrival of the internet for the average human. Over the course of decades, scientists have encouraged the application of techniques that are in the Open Science realm, but with AI, Open Science requires to revisit many of its pillars and values.

For example, the necessity of communicating with another person in order to review your work is quickly being overtaken by using LLMs to help improve and edit your work, challenging the topic of simple collaborations and peer review. In the case of building code, LLMs are a fantastic resource that can help removing typos, encourage conciseness, and create helpful comments.

However, these can act like echo chambers, where your expectations can lead to hallucinations or even pave the way to the "new p-hacking": prompt-hacking.

Therefore, it is imperial that as scientists we embrace the discussion of AI in Open Science, understand how it can help us with our daily work and challenge ourselves to ensure that science stays human.

LLM Chatbots for Open Science¶

Large Language Model (LLM) chatbots have fundamentally changed how we humans are going to interact with computers going forward. They provide a natural language interface to instruct computers to do many tasks including:

Read, write, and summarize text
Analyze data
Explain techical topics
Search the web and retrieve information
Generate, optimize, and explain many types of computer code
Understand and generate images

Current LLMs generally provide recommendation for how you could do things. ie, they provide you code and text recommendations but don't actually execute anything. But these technologies are advancing quickly and new capabilities are developed and released constantly. Soon, AI Agents could be everywhere executing on instructions in autonomous and semi-autonomous ways.

Commercial Chatbots¶

LLMs in 150 words (or less)¶

How they're made: LLMs work by training on vast amounts of text from the internet. They learn patterns, grammar, and context from this data. When you give them a prompt, they generate text based on what they've learned. Imagine a super-smart autocomplete for text, but it can also create entire paragraphs or articles.

How they work: LLMs don't understand like humans do. They predict what comes next in a sentence using math and probabilities. They don't have thoughts or feelings. They mimic human language but can make mistakes or write nonsense if not guided well.

How you can use them: They're incredibly versatile. You can use them for answering questions, writing essays, coding help, and more. But you must be cautious because they can generate biased or false information if not used responsibly.

In a nutshell, LLMs are like super-powered text generators trained on the internet's vast knowledge.

⚠️⚠️ VERIFY EVERTHING CHATBOTS TELL YOU! ⚠️⚠️

Prompt Writing¶

LLM Chatbots are meant to be conversational. In general, you are asking the Chatbot questions (known as Prompts) and the Chatbot will respond with answers.

It is a bit of an artform to get the Chatbot to provide answers with the specificity and format that you want. An entire field of study has sprung up, called Prompt Engineering, is a technique of crafting effective instructions using AI large language models. With modern AI-powered tools like Claude Desktop, ChatGPT, Gemini, and NotebookLM offering capabilities to upload documents, search the web, and process multiple file types, mastering prompt engineering has become essential for productive AI interactions.

Prompt Priming¶

Provide lots of organized details to help the Chatbot understand the question and what it's task is. This could include adding a backstory or context for why you are asking the question. Be very specific in terms of what you want from the Chatbot and how you want it.

Zero-shot unconditioned prompts are likely to return the least specific responses. Responses are more likely to be useful when multiple specific output types are defined.

Types of Priming	Example
Zero (Shot)	"Write five examples of assessments for watershed health."
Single	"Write five examples of assessments for watershed health. Here is one example: Geomorphology"
Multiple	"Write five examples of assessments for watershed health related to geomorphology, water quality, and species diversity."

Linked Prompts¶

Responses to prompts may not return the exact details or information that you are after the first time. Follow-up by rephrasing your prompts more carefully and continuing with iterative prompting can build upon your priors.

"Chain prompting" or "Linked Prompting" brings multiple prompts together.

Linked Prompting	Examples
Step 1: Priming	"I want you to act as an eminent hydrologist from CUASHI. Provide me with a list of the ten most important topics in hydrology over the last decade focused around research in the global south, working with indigenous communities, and traditional ecological knowledge systems."
Step 2: Summarizing	"Based on the list you just created, summarize the most pressing financial challenges faced by indigenous communities in the Global South, versus indigenous communities in North America, in less than 50 words."
Step 3: Try again with a web search (Control)	"Based on the results of web access, can you confirm the validity of the ten important topics and provide at least one reference to each."

Encouraging the Chatbot to do Better

Chatbot responses can be missing information or just plain wrong. When this occurs, you can point out the mistake and ask the Chatbot to provide a more complete or better answer. Don't settle for poor responses!

graph LR
  A[Priming] --> B{Result?};
  B -->|Yes| C[Summarize];
  C --> D[Quality Control];
  D --> B;
  B ----> E[Yay!];

Role Playing¶

Some people find that asking the Chatbot to adopt a persona will lead to better responses.

"I want you to act as ..." will establish what type of conversation you are planning to have.

Types of Roles
Project Manager
Copywriter / Editor
Paper Reviewer
Teacher / Mentor / Advisor
Student / Learner / Participant
Software Engineer
DevOps Engineer
Linux Terminal
Python Interpreter
Web Browser

Understanding Modern AI Capabilities¶

How AI Models Process Your Input¶

The Processing Pipeline

Tokenization: Your prompt is broken into smaller units (tokens)
Context Assembly: Uploaded documents and conversation history are included
Attention Mechanism: The model identifies relevant information
Generation: Response is produced token by token
Formatting: Output is structured according to your specifications

The Foundation: Clear Instructions¶

Start with simple, direct prompts before advancing to complex techniques:

# Basic Prompt
"Summarize this research paper in 3 bullet points"

# Better Prompt
"As a research scientist, summarize the key findings from this paper 
in 3 bullet points, focusing on methodology and results"

# Best Prompt
"You are a research scientist reviewing papers for a journal. 
Summarize the attached PDF in 3 bullet points that cover:
1. Research question and hypothesis
2. Methodology and sample size
3. Key findings and limitations
Format as a bullet list with sub-points for clarity."

Core Features of Today's AI Tools¶

Modern AI assistants have evolved beyond simple text chat:

Feature	Claude	ChatGPT	Gemini	NotebookLM	CoPilot
Document Upload	PDFs, text, code	PDFs, images, data	PDFs, images, GDrive	PDFs, Google Docs	PDFs, OneDrive
Web Search	Via MCP	Yes	Yes	Yes	Yes
Context Window (tokens)	200K	128K	2M	Document-based	128K
File Analysis	Yes	Yes	Yes	Deep analysis	Yes
Code Execution	Yes (MCP)	Yes	Yes	No	Yes

Working with Documents¶

Modern AI tools excel at document analysis. Here's how to maximize their potential:

Document Upload Best Practices

Specify the document: "In the attached PDF..." or "Based on the uploaded spreadsheet..."
Direct attention: "Focus on Section 3.2 of the document"
Request specific outputs: "Create a table comparing the methods described in chapters 2 and 5"
Combine multiple sources: "Compare the findings in these three papers"

Example: Multi-Document Analysis¶

I've uploaded three research papers on climate change. Please:

1. Create a comparison table with columns for:
   - Paper title and authors
   - Methodology
   - Key findings
   - Limitations

2. Identify common themes across all papers

3. Highlight any contradictory findings

Format the response with clear headers and use markdown tables.

The CRAFT Framework¶

For consistent, high-quality results, use the CRAFT framework:

Action	Description
Context	Provide background information and set the scene
Role	Define who the AI should act as
Action	Specify exactly what you want done
Format	Describe how the output should be structured
Tone	Indicate the style and voice to use

CRAFT Example¶

Context: I'm preparing a grant proposal for NSF funding on AI in education

Role: Act as an experienced grant writer and education researcher

Action: Review my draft introduction and suggest improvements

Format: Provide feedback as tracked changes with explanations

Tone: Professional, constructive, and encouraging

Advanced Techniques¶

1. Custom Instructions and System Prompts¶

Modern AI platforms allow you to set persistent instructions:

'Custom Instructions' or 'System Instructions'

Platforms like Gemini and Claude allow you to add "Custom Instructions" or "System Instructions" as prior prompts, which act as a global rule to subsequent prompt chaining.

For example:

# Project Context
I'm a data scientist working on machine learning projects.
Always provide Python code examples using scikit-learn and pandas.
Include docstrings and type hints in all code.

# Response Preferences
- Be concise but thorough
- Explain complex concepts with analogies
- Always cite sources when making factual claims

2. Leveraging Web Search¶

Most featured GPTs now feature a web browse or search engine capability.

Enabling search allows the GPT to use document retrieval on websites and PDFs when reasoning out its response.

Search for the latest research on the public health benefits of vaccination published in 2024. 

Focus on:
- Papers from top conferences (AHA, ASPPH, NRHA, ICFMDP)
- mRNA
- Bird Flu and COVID

Summarize the top 5 papers with links to the originals.

Combine different input types for richer interactions:

I've uploaded:
1. A screenshot of my dashboard
2. The underlying data in CSV format
3. Our brand guidelines PDF

Create a redesigned dashboard that:
- Improves data visualization based on best practices
- Adheres to our brand colors and fonts
- Highlights the KPIs mentioned in the data dictionary

4. Prompt Chaining¶

Build complex outputs through sequential prompts:

Effective Chaining Strategy

Start broad: "Outline a research paper on sustainable AI"
Zoom in: "Expand section 3 on energy-efficient training methods"
Refine: "Add citations and make the tone more academic"
Polish: "Format according to IEEE standards"

5. Using Examples (Few-Shot Learning)¶

Provide examples to guide the AI's output:

I need to classify customer feedback. Here are examples:

"The product arrived damaged" → Category: Shipping Issue
"Can't log into my account" → Category: Technical Support
"Love the new features!" → Category: Positive Feedback

Now classify these:
1. "The app keeps crashing on startup"
2. "Best purchase I've made this year"
3. "Package was left in the rain"

Practical Applications¶

Research and Analysis¶

Analyze the attached dataset (CSV) and:
1. Identify statistical patterns and outliers
2. Create visualizations for the top 3 insights
3. Write a methods section describing the analysis
4. Suggest additional analyses based on the data

Use pandas profiling techniques and create matplotlib visualizations.
Include code that I can run locally.

Writing and Editing¶

I've uploaded my draft manuscript. Please:

1. Check for consistency in terminology throughout
2. Ensure all figures are referenced in the text
3. Verify the citation format matches APA 7th edition
4. Highlight any unclear passages
5. Suggest improvements for flow between sections

Provide a tracked-changes version and a summary of major edits.

Code Development¶

Based on the uploaded requirements document:

1. Create a Python class structure for the described system
2. Include comprehensive docstrings and type hints
3. Add unit tests for each method
4. Create a README with installation and usage instructions
5. Follow PEP 8 style guidelines

Use modern Python features (3.10+) and include error handling.

Common Pitfalls and Solutions¶

Pitfall 1: Vague Instructions¶

Poor: "Make this better"

Better: "Improve this abstract by making it more concise (under 250 words), adding keywords, and ensuring it follows the journal's structure: background, methods, results, conclusions"

Pitfall 2: Information Overload¶

Poor: "Uploading 50 documents without guidance"

Better: "Focus on documents 1-3 which contain the methodology. Ignore the appendices."

Pitfall 3: Assuming Knowledge¶

Poor: "Fix the usual issues"

Better: "Check for: passive voice, sentences over 25 words, undefined acronyms, and missing Oxford commas"

Pitfall 4: No Output Format¶

Poor: "Summarize this"

Better:
Create an executive summary with:
- 3-sentence overview
- 5 key points as bullets
- 1 paragraph on implications
- Formatted with markdown headers

Prompting Chatbots for FOSS¶

Provide a general outline for a data management plan¶

I am writing a grant proposal to the National Science Foundation. 
Could you please provide me a basic template for a data management plan (DMP) and 
please provide url links to resources that can help me with NSF DMP requirements.

Provide a step-by-step recipe to create and serve an mkdocs website in Github¶

I would like to create a personal website using the MKdocs style 
and host it on Github pages.

Could you please write me a step-by-step guide starting 
with importing an existing github repository that has the mkdocs material.

Write shell commands and shell scripts¶

I would like to create a linux shell script to automate the backup of my working directory. 
Could you please suggest a shell script that will copy my working directory 
in a different directory and compress the file into an archive. 
Please name the file based on the current time and date.

Write git commands¶

Could you please provide me a step-by-step workflow for using git with github? 
I found a repository that I want to build on in Github. 
I would like to work on the material on my local machine and then save it back up to github. 
I would like to workflow to be for the linux command line.

Write download and conda commands¶

I am writing a lot of scripts using python. I have heard that environment managers such as conda may be useful to me. 
I don't know anything about conda, so can you explain some things?
1. Give me a high level overview of what environment managers are and what conda is specifically.
2. Please create a step-by-step guide for downloading conda on my machine, and how to use conda to create custom environments. 
3. Please explain and give my steps to share my environment with colleagues.

Write docker run commands¶

I would like to run a docker container that consists of a jupyter notebook. 
Can you please suggest a docker run command that launches the jupyter notebook
and mounts a volume of data in it.

Write docker files¶

I would like to create a docker image that consists of R studio and 
some customized Rcode. Can you tell me the steps to 1. make a dockerfile and 
build the docker image; and 2. Upload the docker image to docker hub.

ChatGPT Awesome Lists

There is an ever changing meta-list of Awesome lists curated around ChatGPT plugins and extensions.

search: chatgpt+awesome

Check out lists around:

ChatGPT Prompts

ChatGPT Data Science Prompts

API plugins, extensions, & applications

Local LLMs vs APIs¶

Managing API keys¶

Extension Installation¶

Open VS Code.
Navigate to the Extensions view by clicking the icon in the Activity Bar on the side of the window or by pressing Ctrl+Shift+X (Windows/Linux) or Cmd+Shift+X (macOS).
In the search bar, type "Cline"
Find the official extension from the search results and click Install.
Once installed, you might need to reload VS Code if prompted.

Selecting an API¶

After installation, you'll typically need to configure an LLM API endpoint and key. Look for settings related to Roo Code or Cline in VS Code's settings (Ctrl+, or Cmd+,).

Google Gemini¶

Obtain your Google Gemini API key from Google AI Studio or Google Cloud Console.
In VS Code settings, search for "Roo Code Gemini" or a similar setting.
Enter your API key in the designated field (e.g., Roo Code: Gemini API Key).
You might also need to specify the model (e.g., gemini-pro-2.5).

Ollama (for Local Models)¶

Ollama allows you to run open-source LLMs locally.

Ensure Ollama is installed and running on your machine with the desired models downloaded (e.g., ollama pull gemma3:1b).
In VS Code settings for Roo Code/Cline, look for an option to specify the Ollama API endpoint. This is usually http://localhost:11434 by default.
Select or specify the Ollama model you wish to use (e.g., gemma, qwen). No API key is typically needed for local Ollama usage directly, but the extension must be configured to point to the local server.

OpenAI Compatible¶

This is for services that adhere to the OpenAI API specification, which can include OpenAI itself or other providers like Azure OpenAI or local LLM servers.

Obtain your API key and API base URL (endpoint) from your provider.
- For OpenAI: Key from platform.openai.com. Endpoint is typically https://api.openai.com/v1.
- For Azure OpenAI: Key and endpoint from your Azure deployment.
- For others: Refer to your provider's documentation.
In VS Code settings for Roo Code/Cline:
- Enter the API key (e.g., Roo Code: OpenAI API Key).
- Enter the API base URL if it's different from the default (e.g., Roo Code: OpenAI API Base URL).
- Select the desired model (e.g., gpt-4o).

Claude (via API)¶

If Roo Code/Cline supports direct Claude API integration (distinct from the Claude Desktop app):

Obtain your Anthropic API key from the Anthropic Console.
In VS Code settings for Roo Code/Cline, search for "Roo Code Claude" or a similar setting.
Enter your API key (e.g., Roo Code: Claude API Key).
Specify the Claude model you wish to use (e.g., claude-4-sonnet).

Restart for Changes

After changing API settings, it's often a good idea to restart VS Code or the extension itself if it provides such an option, to ensure the new settings take effect.

Setting up GitHub Copilot on VS Code Locally¶

GitHub Copilot is deeply integrated into the GitHub ecosystem and VS Code (local).

In GitHub CodeSpaces¶

Enable Copilot for your account: Ensure you have an active GitHub Copilot subscription associated with your GitHub account.
Launch a CodeSpace: When you create or open a repository in GitHub CodeSpaces, Copilot is often enabled by default if your account has access.
Check Status: Look for the Copilot icon in the status bar at the bottom of the VS Code interface within CodeSpaces. If it's not active, click it to see options or troubleshoot. You might need to authorize it for the specific CodeSpace.

Extension Installation in VS Code (Desktop)¶

Open VS Code.
Navigate to the Extensions view ( or Ctrl+Shift+X / Cmd+Shift+X).
Search for "GitHub Copilot".
Find the official extension by GitHub and click Install.
Sign In: After installation, VS Code will prompt you to sign in with your GitHub account. Follow the prompts to authorize VS Code to use GitHub Copilot.
- If you're not prompted, you can often click the user icon in the bottom left of VS Code and sign in there, or find a "Sign In to GitHub Copilot" command in the Command Palette (Ctrl+Shift+P or Cmd+Shift+P).
Once signed in and with an active subscription, Copilot will be ready to assist you. You'll see its icon in the status bar.

Model Context Protocol (MCP)¶

Model context protocol (MCP) is an open-source standard created by Anthropic, designed to communicate with LLMs and other AI systems.

MCP establishes a common protocol (language) for an AI assistant ("client") to request information or to execute actions from an external service ("server").

MCP actions include reading the contents of files, querying other APIs or databases, writing new files or copying data, or executing commands. the MCP protocol defines the structure of the messages.

MCPs offer a powerful framework for enhancing scientific reproducibility.

Standardized Data Access
Containerized Computational environments
Sharable Workflows

MCPs for Coding and commands¶

A foundational MCP tool to use is the filesystem which gives the LLM the ability to read, write, and execute code on your computer or a remote server.

Ask the LLM to create a new file
Request the AI to refactor code or edit text in a file
Search through your codebase for relevant functions
Execute terminal commands

Vibe Coding¶

Vibe coding refers to using an LLM to generate and edit code directly within your IDE (e.g., VS Code). This approach allows for a more fluid and interactive coding experience, where the LLM acts as a collaborative partner.

Allowing an LLM to execute code on your computer may be a violation of institutional security and privacy policy

Coding tools like Cline and Windsurf give you the option to allow 'execution' of code on your machine.

You must understand the implications of giving these LLMs the authority to execute code on your computer and the network it is running upon.

Malicious code lives on the internet, and your Vibing LLM might install it while you're not paying attention

Read more: Vibe Check: False Packages A New LLM Security Risk (Note: This is a fictional link as per the example for demonstration).

Vibe Coding Platforms¶

Emoji	Meaning
	VS Code
	GitHub CodeSpace
	Apple OS
	Windows
	Command Line Interface
	Open Source
	Licensed
	API based

Claude Desktop An easy-to-install desktop platform that connects to Anthropic's powerful LLM API, and allows you to connect to MCP servers.
Cursor A popular standalone fork of VS Code, focused on integrating new models with stability and offering a flat-fee pricing model.
GitHub Copilot Integrated with VS Code and GitHub CodeSpaces, provides agentic coding with periodic performance fluctuations and tiered pricing.
Cline Open-source and model-agnostic, pioneering features like “bring your own model” (BYOM) and operating on a per-request billing structure.
Windsurf Offers similar agentic and inline features with tiered pricing and a “just works” usability orientation.

Quick Reference Card¶

Prompt Engineering Checklist

Clear objective: What do you want to achieve?
Context provided: Background information included?
Role defined: Who should the AI act as?
Specific action: Exact task described?
Output format: Structure specified?
Examples given: For complex tasks?
Constraints noted: Length, style, or content limits?
Documents referenced: If using uploads?
Follow-up planned: For iterative improvement?

Assessment Questions¶

How do modern AI tools handle uploaded documents?

Answer

Modern AI tools process uploaded documents by:

Converting them to text (OCR for images/PDFs)
Adding them to the context window
Allowing specific references ("In section 2.3...")
Enabling cross-document analysis
Maintaining document structure awareness

What's the most important element of an effective prompt?

Answer

Clarity of instruction is paramount. The AI needs to understand:

What you want done (action)
How you want it done (format)
Why you want it done (context)

Without clear instructions, even the most advanced AI will produce suboptimal results.

How can you ensure consistent outputs across multiple sessions?

Answer

Use custom instructions (ChatGPT, Claude) or system prompts
Create templates for common tasks
Save successful prompts for reuse
Use platform features like GPTs or Projects
Include examples in your prompts
Specify exact formats with templates

True or False: Longer prompts always produce better results

False

Prompt quality matters more than length. A well-structured, concise prompt often outperforms a lengthy, unfocused one. However, providing sufficient context and clear instructions is important. Aim for:

Completeness over brevity
Clarity over complexity
Structure over stream-of-consciousness

How to Talk to Large Language Models¶

How is the AI revolution impacting Open Science?¶

LLM Chatbots for Open Science¶

Commercial Chatbots¶

LLMs in 150 words (or less)¶

Prompt Writing¶

Prompt Priming¶

Linked Prompts¶

Role Playing¶

Understanding Modern AI Capabilities¶

How AI Models Process Your Input¶

The Foundation: Clear Instructions¶

Core Features of Today's AI Tools¶

Working with Documents¶

Example: Multi-Document Analysis¶

The CRAFT Framework¶

CRAFT Example¶

Advanced Techniques¶

1. Custom Instructions and System Prompts¶

2. Leveraging Web Search¶

3. Multi-Modal Prompting¶

4. Prompt Chaining¶

5. Using Examples (Few-Shot Learning)¶

Practical Applications¶

Research and Analysis¶

Writing and Editing¶

Code Development¶

Common Pitfalls and Solutions¶

Pitfall 1: Vague Instructions¶

Pitfall 2: Information Overload¶

Pitfall 3: Assuming Knowledge¶

Pitfall 4: No Output Format¶

Prompting Chatbots for FOSS¶

Provide a general outline for a data management plan¶

Provide a step-by-step recipe to create and serve an mkdocs website in Github¶

Write shell commands and shell scripts¶

Write git commands¶

Write download and conda commands¶

Write docker run commands¶

Write docker files¶

Local LLMs vs APIs¶

Managing API keys¶

Extension Installation¶

Selecting an API¶

Google Gemini¶

Ollama (for Local Models)¶

OpenAI Compatible¶

Claude (via API)¶

Setting up GitHub Copilot on VS Code Locally¶

In GitHub CodeSpaces¶

Extension Installation in VS Code (Desktop)¶

Model Context Protocol (MCP)¶

MCPs for Coding and commands¶

Vibe Coding¶

Vibe Coding Platforms¶

Quick Reference Card¶

Assessment Questions¶

Further Resources¶