Andromeda Workshop
uv package manager.
conda, which is often used.conda is very mediocre IMO and you often run into dependency issues.uv is available here.
uvuv is very good at managing Python packages and their versions such that all packages are compatible with each other
uv is just faster and more straightforward compared to conda.uvuv, you set-up a Python environment inside this folder. This is accomplished using three subsequent commands in the terminal:uv unit, uv will create the following files with information about your project:├── .gitignore
├── .python-version
├── README.md
├── main.py
└── pyproject.toml
uvuv venv, uv will create a virtual environment allowing it to run Python code independently from all other python versions on your system.1
source .venv/bin/activate in the terminal.
CTRL+P, type Interpreter: Discover All Interpreters and select your virtual environment on the top right of your screen. See here also.uv add numpy polars chatlas dotenv
This command will do two things:
uv init to ensure that the documentation contains the instructions about which packages and which versions have just been installed.numpy and polars are standard data wrangling libraries, chatlas is a library that allows us to interact easily with LLMs.
venv R packages, but in this presentation, we will just use our global R environment.tidyverse, ellmer packages. tidyverse does the data wrangling, ellmer does the interaction with LLMs.From the Positron Website:
Expert data scientists keep all the files associated with a given project together — input data, scripts, analytical results, and figures. This is such a wise and common practice, in data science and in software development more broadly, that most integrated development environments (IDEs) have built-in support for this practice.
File Organization Principle I
Every data science project you do should be in one folder. Everything you use in this project (reports, data, the programming environment, output, etc.) should be inside that folder.. or on the web. But not elsewhere on your system.
Ristovska (2019): Coding for Economists: A Language-Agnostic Guide
Gentzkow and Shapiro (2014): Code and Data for the Social Sciences: A Practitioner’s Guide
McDermott (2020): Data Science for Economists
Hagerty (2020): Advanced Data Analytics in Economics
General principles:
File Organization Principle II
Use separate directories by function. Separate files into inputs and outputs. Always use relative filepaths (“../input/data.csv” instead of “C:/build/input/data.csv”). Define your working directory interactively.
my_project is the folder you open in Positron):/my_project
/rawdata
/code
/1_clean
/2_process
/3_results
/preppeddata
/output
/tables
/figures
/temp
Code Organization Principle
Break programs into short scripts or functions, each of which conducts a single task.
Make names distinctive and meaningful.
Be consistent with code style and formatting.
Why?
def calculate_average_word_length(text_input: str) -> float:
"""Calculates the average length of words in a given text.
This function takes a string of text, splits it into individual words,
and then computes the average length of those words. Punctuation is
handled by a separate helper function.
Args:
text_input: A string containing the text to be analyzed.
Returns:
A floatExample Code Organization representing the average word length. Returns 0.0 if the
input string is empty or contains no words.
"""
# Helper function to remove punctuation from a single word.
def _clean_word(word: str) -> str:
punctuation_to_remove = ".,!?;:"
return ''.join(char for char in word if char not in punctuation_to_remove)
# Return 0.0 immediately if the input is not a string or is empty.
if not isinstance(text_input, str) or not text_input:
return 0.0
cleaned_words = [_clean_word(word) for word in text_input.split()]
# Filter out any empty strings that might result from cleaning.
non_empty_words = [word for word in cleaned_words if word]
if not non_empty_words:
return 0.0
total_length = sum(len(word) for word in non_empty_words)
word_count = len(non_empty_words)
return total_length / word_count
# Example of how to run this.
# 1. Define a sample sentence.
sentence = "Programs are meant to be read by humans, and only incidentally for computers to execute."
# 2. Call the function with the sample sentence.
average_length = calculate_average_word_length(sentence)
# 3. Print the result to see the output.
print(f"The average word length is: {average_length:.2f}")
## The average word length is: 4.80#' Calculate the Average Length of Words in a Given Text
#'
#' @description
#' This function takes a character string, removes common punctuation, splits the
#' string into individual words, and then computes the average length of those words.
#'
#' @param text_input A character string containing the text to be analyzed.
#'
#' @return A numeric value representing the average word length. Returns 0
#' if the input is invalid, empty, or contains no words after cleaning.
#'
#' @examples
#' sentence <- "Programs are meant to be read by humans, and only incidentally for computers to execute."
#' calculate_avg_word_length(sentence)
#'
calculate_avg_word_length <- function(text_input) {
# 1. Input Validation
# Immediately return 0 if the input is not a single, non-empty string.
if (!is.character(text_input) || length(text_input) != 1 || nchar(text_input) == 0) {
return(0)
}
# 2. Pre-processing: Remove Punctuation
# The `gsub` function finds and replaces patterns. Here, `[[:punct:]]` is a
# special class that matches any punctuation character.
cleaned_text <- gsub(pattern = "[[:punct:]]", replacement = "", x = text_input)
# 3. Tokenization: Split Text into Words
# `strsplit` splits the string by whitespace (`\\s+`). It returns a list,
# so we select the first element `[[1]]` to get a vector of words.
words <- strsplit(cleaned_text, split = "\\s+")[[1]]
# 4. Filtering: Remove Empty Strings
# After splitting, there might be empty strings (e.g., from multiple spaces).
# We keep only the elements where the number of characters is greater than 0.
non_empty_words <- words[nchar(words) > 0]
# 5. Calculation
# If our vector of words is empty, we can't divide by zero. Return 0.
if (length(non_empty_words) == 0) {
return(0)
}
# Calculate the sum of the lengths of all words.
total_length <- sum(nchar(non_empty_words))
# Count the number of words.
word_count <- length(non_empty_words)
# Return the final average.
return(total_length / word_count)
}
# Example of how to run this in an IDE
# 1. Define a sample sentence.
sentence <- "Programs are meant to be read by humans, and only incidentally for computers to execute."
# 2. Call the function with the sample sentence.
average_length <- calculate_avg_word_length(sentence)
# 3. Print the result to see the output.
# We use `sprintf` for nice formatting, similar to Python's f-string.
cat(sprintf("The average word length is: %.2f\n", average_length))What if you have to repeat a certain operations a number of times?
Don’t copy-and-paste code. Instead, abstract.
E.g. use functions, loops, vectorization.
Why? Reduces scope for error.
LLMs are highly effective at converting unstructured data into structured formats. Although they aren’t infallible, they automate the heavy lifting of information extraction, drastically cutting down on manual processing. Here are key examples of this utility:
chatlas package (Python) or the ellmer package (R).
pydantic library, which we use to define the structure our data has.pydantic library allows you to define a schema, a blueprint for what your structured data should look like.
chatlas and ellmer.env (this should be just a text file).
OPENAI_API_KEY="[Key]".ellmer, you need to first install the usethis package (should be there on your system already)
usethis::edit_r_environ()OPENAI_API_KEY="[ApiKey]".import requests
import polars as pl
import time
def fetch_gutendex_books(num_pages=100):
base_url = "https://gutendex.com/books/"
all_books = []
# We use a session for connection pooling (more efficient for multiple requests)
with requests.Session() as session:
print(f"Starting to fetch {num_pages} pages...")
for page in range(1, num_pages + 1):
try:
# Construct the URL with query parameters
url = f"{base_url}?page={page}&topic=literature"
response = session.get(url, timeout=10)
response.raise_for_status() # Raise error for 404/500 codes
data = response.json()
# The actual book data is usually inside the 'results' key
if 'results' in data:
all_books.extend(data['results'])
print(f"Fetched page {page}/{num_pages}", end='\r')
# Be polite to the API to avoid being banned
time.sleep(2)
except requests.exceptions.RequestException as e:
print(f"\nError fetching page {page}: {e}")
# Optional: break or continue depending on desired behavior
continue
print(f"\nParsing {len(all_books)} books into Polars DataFrame...")
# Create DataFrame once from the list of dicts (Most efficient method)
# infer_schema_length=None ensures Polars scans all rows to determine types correctly
df = pl.DataFrame(all_books, infer_schema_length=None)
return df
df = fetch_gutendex_books(50)
df_processed = (
df
# 0. Explode the summaries
.explode(['summaries'])
# 1. Keep only the first summary for each ID
.unique(subset=["id"], keep="first")
# 2. Overwrite 'authors' column with just the first author (Struct)
.with_columns(
pl.col("authors").list.first()
)
# 3. Unnest the struct fields into top-level columns
# (This will likely create columns like 'name', 'birth_year', 'death_year')
.unnest("authors")
)
#df_processed.write_parquet('books_processed.parquet')
#df_processed.select(['id', 'title', 'summaries', 'download_count']).write_csv("books_processed.csv", separator="\t")
import polars as pl
import textwrap
print(textwrap.fill(df_processed['summaries'][0], 100))
## "Black Beauty" by Anna Sewell is a novel written in the late 19th century. The story is told from
## the perspective of a horse named Black Beauty, who recounts his experiences growing up on a farm,
## the trials he faces as he is sold into various homes, and the treatment he receives from different
## owners. The narrative touches on themes of animal welfare, kindness to creatures, and the importance
## of humane treatment. At the start of the book, we are introduced to Black Beauty's early life in a
## peaceful meadow, where he lives with his mother, Duchess. He is fondly raised by a kind master and
## learns valuable lessons about good behavior from his mother. As he matures, the story unfolds to
## include his experiences with other horses, the harsh realities of training and harnessing, and the
## contrasting environments in which he lives – some nurturing, and others cruel. The opening chapters
## set the tone for a deeper exploration of social issues regarding the treatment of horses and the
## relationships they develop with humans. (This is an automatically generated summary.)
print(textwrap.fill(df_processed['summaries'][13], 100))
## "A Key to Uncle Tom's Cabin" by Harriet Beecher Stowe is a historical account written in the
## mid-19th century. The book serves as a companion piece to Stowe's famous novel "Uncle Tom's Cabin,"
## providing factual evidence, documents, and corroborative statements to verify the realities of
## slavery depicted in the fictional narrative. It aims to draw attention to the moral and ethical
## implications of slavery, evoking a serious contemplation of a deeply troubling institution. The
## opening of "A Key to Uncle Tom's Cabin" begins with a preface wherein Stowe openly shares her
## struggle in writing this non-fiction work, emphasizing that slavery is an intrinsically dreadful
## subject. She notes that her task has expanded beyond her original intent, driven by the need to
## confront the painful truths surrounding slavery as a moral question. The first chapter focuses on
## various dynamics of the slave trade, illustrated through characters such as Mr. Haley, a slave
## trader, shedding light on the grim realities faced by individuals caught in this trade. Stowe
## underscores that the depictions in "Uncle Tom's Cabin," while fictionalized, are based on real
## events and sentiments, thus legitimizing the emotional and physical toll inflicted upon those
## ensnared in slavery. (This is an automatically generated summary.)pydantic Schemepydantic scheme that allows us to identify the characteristics of the main character(s) out of the summaries:from pydantic import BaseModel, Field
from typing import List, Literal, Optional
class MainCharacter(BaseModel):
species: Optional[str] = Field(
default=None,
description="The biological species or entity type of the character (e.g., Human, Elf, Robot)."
)
gender: Optional[Literal["Male", "Female"]] = Field(
default=None,
description="The biological sex or gender identity of the character."
)
age: Optional[Literal["Child", "Young Adult", "Adult", "Old Person"]] = Field(
default=None,
description="The approximate life stage or age category of the character."
)
moral_classification: Optional[Literal["Good", "Neutral", "Evil"]] = Field(
default=None,
description="The ethical alignment or moral compass of the character."
)
class SummaryAnalysis(BaseModel):
characters: List[MainCharacter] = Field(
default_factory=list,
description="A list of main characters found in the summary. Returns an empty list if none are found."
)import chatlas as ctl
from pydantic import BaseModel, Field
import polars as pl
chat = ctl.ChatOpenAI(model='gpt-4.1')
prompts = [summary for summary in df_processed['summaries']]
out = ctl.batch_chat_structured(chat, prompts, 'state.json', SummaryAnalysis)
# Put in Data.Frame
final_df = pl.DataFrame([
{"id": i, "tmp": [c.model_dump() for c in x.characters] or [None]}
for i, x in zip(df_processed['id'], out)
]).explode("tmp").unnest("tmp")
#final_df.write_csv("summaries_processed.csv")# R doesn't need a pydantic scheme - it can be defined on the spot here
main_character <- type_object(
species = type_string(
"The biological species or entity type of the character (e.g., Human, Elf, Robot).",
required=FALSE),
gender = type_enum(c("Male", "Female"),
"The biological sex or gender identity of the character.",
required=FALSE),
age = type_enum(c("Child", "Young Adult", "Adult", "Old Person"),
"The approximate life stage or age category of the character.",
required=FALSE),
moral_classification = type_enum(c("Good", "Neutral", "Evil"),
"The ethical alignment or moral compass of the character.",
required=FALSE)
)
list_of_main_characters <- type_array(
main_character,
"List of all main characters mentioned in this text.",
required=FALSE)data_processed <- read_delim("books_processed.csv")
prompts <- as.list(
data_processed$summaries
)
chat <- chat_openai(model="gpt-4.1")
out <- parallel_chat_structured(chat, prompts, type = list_of_main_characters)
# Post-processing into data.frame
final_df <- tibble(
id = data_processed$id,
data = out
) |>
unnest(data, keep_empty = TRUE)
write_csv2(final_df, "summaries_processed.csv")summaries <- read_delim('summaries_processed.csv')
books_processed <- read_delim('books_processed.csv')
variables <- summaries |>
group_by(id) |>
summarize(
no_of_protagonists = n(),
male_ratio = mean(gender == "Male"),
young_ratio = mean(age == "Young Adult"),
goodbad = if_else(is.element("Evil", moral_classification), 1, 0))
analysis_df <- books_processed |>
left_join(variables, by = 'id')
head(analysis_df, 3)# A tibble: 3 × 8
id title summaries download_count no_of_protagonists male_ratio young_ratio
<dbl> <chr> <chr> <dbl> <int> <dbl> <dbl>
1 271 Blac… "\"Black… 5104 2 0.5 0.5
2 23997 Euge… "\"Eugen… 4794 2 1 1
3 155 The … "\"The M… 3125 2 1 0
# ℹ 1 more variable: goodbad <dbl>
| (1) | (2) | (3) | (4) | (5) | |
|---|---|---|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |||||
| (Intercept) | 7.784*** | 7.850*** | 7.837*** | 7.751*** | 7.814*** |
| (0.021) | (0.067) | (0.027) | (0.022) | (0.074) | |
| poly(no_of_protagonists, 2)1 | 3.957*** | 2.619** | |||
| (0.801) | (0.997) | ||||
| poly(no_of_protagonists, 2)2 | -2.367*** | -1.662*** | |||
| (0.427) | (0.467) | ||||
| male_ratio | -0.045 | -0.017 | |||
| (0.082) | (0.086) | ||||
| young_ratio | -0.095 | -0.090 | |||
| (0.068) | (0.069) | ||||
| goodbad | 0.252*** | 0.152* | |||
| (0.069) | (0.075) | ||||
| Num.Obs. | 1536 | 1369 | 1369 | 1536 | 1369 |
pydantic and chatlas in Python, and ellmer in R.Project Organization and Structured Data with LLMs