Archive for the ‘Play’ Category

LDA from scratch

June 13, 2012

The company where I work has a semi-regular, informal book club (for nerds – which we are). Recently, my supervisor has been giving talks on some papers which describe extensions of the Latent Dirichlet Allocation (LDA) topic modelling method. While I felt like I had a reasonable-enough high-level understanding, I didn’t feel like I knew the specifics well enough. If I had to do something where I’d have to make an Author-Recipient Topic Model [pdf], for example, I’d be lost (unless I found an existing R package, or so). I didn’t feel like I knew enough about the background probability, or about how to make a computer do anything given a model description and a bunch of data. So I started investigating things at this level, and presented what I had. There’s still plenty of gaps and uncertainty in my understanding, but I thought it’d be good to get notes together before I got too distracted by other things… So, the main goal with LDA is that you’re given a collection of documents (i.e., a corpus), and you’d like to determine a collection of topics that are represented in the documents (and in which proportion for each document), as well as the words which comprise each topic. You don’t know the topics ahead of time, and aim to just determine them based on the words you see in the document. There’s other approaches for accomplishing this, besides LDA; the most common seems to be Latent Semantic Analysis (LSA), which is very much like Principal Component Analysis (PCA), relying on the Singular Value Decomposition (SVD) of the so-called “term-document matrix”. But that’s all the topic of another day. Today, we’re sticking with LDA, which is a more probability-based approach. In this post, I’ll build up to LDA, starting with much similar models and questions. It’s been a while since I had a probability class, so I started with basics…

Coins

Suppose you’ve got a weighted coin, which has a probability \phi of coming up heads. You flip the coin N times, and obtain n heads. What can you say about \phi? Well, the probability of obtaining n heads is given by:

\binom{N}{n}\phi^n(1-\phi)^{N-n}.

One way we can try to say something about \phi is to ask which value of \phi maximizes this expression. Calculus teachers, rejoice! Since the binomial coefficient in the front is constant with respect to \phi, we can drop it, and simply try to maximize \phi^n(1-\phi^n)^{N-n}. Many students will, naturally, dive right in and differentiate this with product and chain rules. Students who have seen the tricks before (or are especially perceptive) will realize that the expression is maximized when its log is maximized, since log is monotonically increasing. So we aim to maximize n\log(\phi)+(N-n)\log(1-\phi), for 0<\phi<1. This derivative is easy to calculate, and has a single 0, at n/N, which can be verified to be a local/absolute maximum.

So, that’s nice. And a lot of work to determine what everybody probably already knows. If you’ve got a biased coin that you flip 100 times, and obtain 75 heads and 25 tails, your guess probably should be that the sides are weighted 3 to 1 (i.e., that \phi=.75). Knowing that it’s all random to an extent, though, we aren’t certain of our answer. What we’re really like to know is the probability that \phi is 0.75, versus, say, 0.72 (which could also certainly have generated the observed data).

Well, this is what Bayes’ Theorem is for (or, at least, this is an application of it). We just calculated P(n|\phi), and want to, instead, calculate P(\phi|n). Bayes tells us we can do this via

P(\phi|n) = \dfrac{P(n|\phi)P(\phi)}{P(n)}.

The denominator is the same as \int_\phi P(n|\phi)P(\phi)\ d\phi (I’ll write x‘s below to either reduce or compound notational confusion, depending on your viewpoint). Substituting our expression for P(n|\phi), above, and cancelling the resulting binomial coefficients, we obtain

P(\phi|n) = \dfrac{\phi^n(1-\phi)^{N-n}P(\phi)}{\int_0^1 x^n(1-x)^{N-n}P(x)\ dx}.

But there’s still unknown expressions in there! We still need to know P(\phi), in order to find P(\phi|n). That’s sorta frustrating. We could pick any probability distribution for P(\phi), how to we decide which to pick? Well, an easy choice would be the uniform distribution, because then we can just put 1 in the expression above in a few places, and obtain

P(\phi|n) = \dfrac{\phi^n(1-\phi)^{N-n}}{\int_0^1 x^n(1-x)^{N-n}\ dx}.

The denominator there has no \phis in it, and so is “just” a normalization constant (more on this in a moment). That is, P(\phi|n) is basically given by \phi^n(1-\phi)^{N-n}, once you divide by the integral of that expression over the domain (the interval from 0 to 1) so that you’ve got a probability distribution. The distribution we’ve obtained is called the Beta distribution, with parameters n+1 and N-n+1 (the “plus 1s” might become more sensible in a moment). With our 75 out of 100 example from before, this distribution is as follows:

The Beta(x;76_26) distribution

The mode (not the mean) for this distribution is exactly what we calculated before, 75/100.

Now, something somewhat interesting just happened, without our trying much. We didn’t really have many good guesses for what the distribution of P(\phi), so we just picked the uniform distribution. It turns out that the uniform distribution is a special case of the Beta distribution, namely B(x;1,1). What’s noteworthy, if you’re into that sort of thing, is that we started with a “prior” guess for the distribution of \phi of a certain form, B(x;1,1), and then we updated our beliefs about \phi based on the data (the number of heads and tails flipped), and ended up with a “posterior” distribution for \phi of the same “form”, B(x;1+n,1+N-n). This is sort of a happy accident, if you’re required to do this sort of Bayesian analysis by hand, apparently. It makes it easy to update again, if we flip the coin some more times. Suppose we flip several more times and see H heads and T tails. Well, we’d use our latest beliefs about \phi, namely that it came from B(x;1+n,1+N-n), and then update that information as before. All that happened before was that we added the number of heads to the first parameter of the Beta distribution, and the number of tails to the second parameter, so we’d do that again, obtaining B(x;1+n+H,1+N-n+T). In fact, there’s no reason we’d have to cluster our flips into groups – with each flip we could update our Beta distribution parameters by simply adding 1 to the first parameter for a head, and to the second parameter for a tail. Oh, and the more formal description for what’s going on here, besides “happy accident”, is that the Beta distribution is the conjugate prior for the Binomial distribution.

This also suggests that we didn’t have to start with the uniform distribution, B(x;1,1), to have this happy accident occur. We could start with any B(x;\alpha,\beta), and then after observing H heads and T tails, we’d update our beliefs and say that \phi came from B(x;\alpha+H,\beta+T). So maybe there’s a better choice for those initial parameters, \alpha and \beta? In a very real sense, which we’ve already seen with this updating procedure, these parameters are our prior belief about the coin. If we’ve never flipped the coin at all, we’d sorta want to say that \alpha and \beta are 0, but the parameters of the Beta distribution are required to be positive. So we’d say we don’t really know, and choose a “weekly informative” prior, say 0.001 for both \alpha and \beta. The point of choosing such a small prior is that it lets the data have a strong governance on the shape of the distribution we get once we update our beliefs having seen some data. This is because B(x;\alpha+n,\beta+N-n) is very similar to B(x;n,N-n) if \alpha and \beta are small (relative to n an N-n).

If you’d like to play around with actually running some code (and who doesn’t?), below is some R code I threw together. It simulates a sequence of flips of a weighted coin, and plots the progression of updates to our belief about \phi as we run through the flips.

coinWeights <- c(.65,.35)

prior <- c(1,1)

numFlips <- 10

flips <- sample(1:2, numFlips, prob=coinWeights, replace=TRUE)
print(flips)

betaDist <- function(alphaBeta) {
  function(x) {
    dbeta(x, alphaBeta[1], alphaBeta[2])
  }
}

colors <- rainbow(numFlips)

plot(betaDist(prior), xlim=c(0,1), ylim=c(0,5), xlab="", ylab="")
for (n in 1:numFlips) {
  prior[flips[n]] <- prior[flips[n]] + 1
  func <- betaDist(prior)
  curve(func, from=0, to=1, add=TRUE, col=colors[n])
}

legend("topleft", legend=c("prior",paste("flip",c("H","T")[flips])), col=c("#000000",colors), lwd=1)

I ran it once, obtaining the following plot: So that’s fun. You might play with that code to get a sense of how choosing bad priors will make it take a while (many coin flips) before the posterior distribution matches the actual coin. Before moving on to more complicated models, I’d like to mention that calling the denominator we obtained when we applied Bayes’ Theorem, above, “just” a normalizing constant, above, is a dis-service. The denominator is a function of the parameters \alpha and \beta, as is called the Beta function. Right off the bat, Calculus teachers can again rejoice at the integral determining B(.5,.5):

B(.5,.5)=\int_0^1 \dfrac{dx}{\sqrt{x}\sqrt{1-x}}=\cdots=\pi.

That integral’s doable by hand, and, unfortunately, an example of an exam question Calculus teachers probably would wet their pants over. But, better than simple tricks where goofy looking integrals “happen” to be related to circles (where did \pi come from, there, anyway?), we also have

B(x,y) = \dfrac{\Gamma(x)\Gamma(y)}{\Gamma(x+y)},

where

\Gamma(z) = \int_0^{\infty} e^{-t}t^{z-1}\ dt

is the analytic extension of (z-1)! to the complex plane (our binomial coefficients from before seem to have been wrapped up into the normalizing constant, B(\alpha,\beta)). While I’m sure there’s no shortage of places you could read more about either the Beta or Gamma functions, I must say that I was enthralled with Artin’s book on the Gamma function. Delightful. Ok, moving on.

Dice

Let’s take a moment to slightly alter our notation, making it easy to extend to more dimensions (number of sides on the coin/die). Let’s write \phi_1 and \phi_2 for the probability of obtaining a head or tail on a single flip, respectively. Similarly, let’s write n_1 and n_2 for the number of observed heads and tails, respectively, with N=n_1+n_2 still the total number of flips. Just to double-check if you’re following, our original probability of obtaining this data from the parameter can now be written

P(n|\phi) = \dfrac{N!}{n_1!n_2!}\phi_1^{n_1}\phi_2^{n_2}.

It’s probably now relatively easy to see how to generalize this. Suppose there are V possible outcomes of our “coin” (/die/…), and that the probability of rolling side i is given by \phi_i (letting \phi=(\phi_1,\ldots,\phi_2)). Suppose we roll our die N times, obtaining side i a total of n_i times (and letting n=(n_1,\ldots,n_N). Given just the sequence of rolls (and the number of sides), what can we say about \phi? The multi-dimensional generalization of the Beta distribution is the Dirichlet distribution. It takes parameters \beta=(\beta_1,\ldots,\beta_V) and is defined over the points x=(x_1,\ldots,x_V) having 0<x_i and \sum_i x_i=1 (i.e., x lies on the hyper-triangle (simplex) spanned by the V canonical unit vectors in V-dimensional space). Its density function is

D(x,\beta) = \dfrac{1}{B(\beta)}\prod_i x_i^{\beta_i-1}.

As before, we can treat B(\beta) as “just” a normalization constant. One useful thing about the Dirichlet distribution is that it’s defined over a particular subset of V-dimensional space – namely, the subset where the coordinates sum to 1 and are all positive. So if we draw a sample \phi from D(x;\beta), then \phi=(\phi_1,\ldots,\phi_V) can be thought of a discrete probability distribution defined on a set of size V. More in line with the die example, a sample from D(x;\beta) gives us weights for the sides of our die. The Beta distribution had the nice updating property described above – if we thought the parameters where \alpha and \beta and then we observed n_H heads and n_T tails, our new belief about the weight of the coin is described by parameters with values \alpha+n_H and \beta+n_T. The Dirichlet distribution updates just as nicely: If we previous thought the parameters were \beta (a vector of length V), and then we observed counts n (n_i the number of times we rolled side i), then our new parameters are simply \beta+n. One of my goals, coming in to this learning project, was to actually figure out how to use OpenBUGS to do Bayesian analysis. Instead of doing the algebra by hand above to determine the posterior distributions (update our beliefs after we see some data), we’re supposed to be able to code up some model and provide it, with the data, to OpenBUGS, and have it return something like the answer. More specifically, BUGS, the Bayesian inference Using Gibbs Sampler, runs a Markov Chain Monte Carlo algorithm (presumably Gibbs sampling, based on the acronym), and returns to us draws from the posterior distribution. So even if we don’t know the analytic expression for the posterior, we can let OpenBUGS generate for us as many samples from the posterior as we’d like, and we can then address questions about the posterior distribution (e..g, what is the mean?) by way of the samples. I’d love, of course, to dig in to the mechanics of all this, but in the process of preparing my talk I found that another guy at work was already a few steps ahead of me down that road. So I’m going to let him tell me about it, which I very much look forward to. For now, my goal will simply be to figure out how to get OpenBUGS to answer the same question we’ve already done analytically in the case of a coin (V=2): Given the sequence of rolls, what can it tell me about \phi? To start, let’s return to our apparently simple die-rolling procedure. What we have is a “generative model” for producing a data set. Specifically, we run through the following steps:

  1. Choose a distribution of weights for sides of our die, by picking \phi from D(\beta).
  2. For each of N iterations,
    1. Roll the die, recording the side, w_i, a sample from the categorical distribution Cat(\phi).

Having written what’s going on in this manner, we’re basically right at a description of the model in the BUGS language. What we’ve just written would get translated as:

model {
  phi[1:V] ~ ddirich(beta[])
  for (n in 1:N) {
    w[n] ~ dcat(phi[])
  }
}

(Your mileage may vary with ddirich versus ddirch.) However, this model isn’t quite complete enough. We haven’t said anything about \beta in this model, but it certainly isn’t data we’ll be feeding in to the model (that’d be w, the sequence of rolls). This is where I fall back on “wave the wand” – there’s still a fair amount of magic, for me, built into the priors when working with Bayesian inference. I’ll need to learn more about this, but for now, I’ll tell BUGS to use weakly informative priors (like we did with the coin). The complete model, then, is described to BUGS as:

model {
  phi[1:V] ~ ddirich(beta[])
  for (n in 1:N) {
    w[n] ~ dcat(phi[])
  }
  for (v in 1:V) {
    beta[v] <- 0.001
  }
}

But how does BUGS know what “V” and “N” and “w[n]” are? Well, we have to feed it the data. For this, I’ll let the R package rbugs do the hard work. I’ll work with relatively familiar R objects (which still surprise and frustrate me regularly), and let rbugs translate to and from OpenBUGS for me. The model above needs to be saved in a file, say ‘die.txt’. We can then use the following R code to play with this model and generate some pictures for us:

library(rbugs) # talk to OpenBUGS
library(coda) # MCMC utilities (mostly unused below)
library(gtools) # gives us ddirichlet

# set up parameters
# args <- strsplit("100 c(.7,.3) 1 5000 1000 .001", " ")[[1]]
args <- commandArgs(trailingOnly = TRUE)

# die parameters
n.rolls <- as.integer(args[1]) # 100
pre.phi <- eval(parse(text=args[2])) # c(.7,.3)
n.sides <- length(pre.phi) # 2 (calculated)

# mcmc parameters
n.chains <- as.integer(args[3]) # 1
n.iter <- as.integer(args[4]) # 5000
n.burnin <- as.integer(args[5]) # 1000
default.beta <- as.numeric(args[6]) # .001
# generate data
rolls <- sample(1:n.sides, n.rolls, prob = pre.phi, replace=TRUE)

# precalculate right answers
post.beta <- default.beta + tabulate(rolls)
post.phi.marginal <- cbind(post.beta, sum(post.beta)-post.beta) # phi[s] is Beta(beta[s]-1, sum(beta)-2)
post.phi.mean <- post.beta / sum(post.beta)
post.phi.mode <- (post.beta-1) / (sum(post.beta)-2)
post.phi.max <- apply(cbind(post.phi.mode, post.phi.marginal), 1, function(r) { dbeta(r[1], r[2], r[3]) })
# set up data to pass to bugs
die.data <- list(
 V = n.sides,
 N = n.rolls,
 w = rolls
 )
# generate initial values. rbugs doesn't seem to handle empty initial values well
inits <- function() {
 list(beta = rep(default.beta, n.sides))
}
# parameters to be saved, to compare with our algebraic calculations above
parameters <- c("phi")

# give us some samples! (make sure you have a directory 'tmp' under the pwd)
die.sim <- rbugs(die.data, inits, parameters, "die.txt",
 n.chains=n.chains, n.iter=n.iter, n.burnin=n.burnin,
 bugsWorkingDir="tmp", bugs="/usr/local/bin/OpenBUGS", cleanBugsWorkingDir=TRUE)
# coda provides some utilities for if you know enough about mcmc, which i don't
die.mcmc <- rbugs2coda(die.sim)
# die.mcmc is a list, with length = n.chains
# die.mcmc[[n]] has dimension c((n.iter-n.burnin), # variables)

base.filename = paste(args, collapse="_")

# save some stuff, in case you feel like poking at it
save(die.mcmc, args, rolls, file=paste(base.filename, 'RData', sep='.'))

# set up some drawing utilities
colors <- rainbow(n.sides)
xlim <- c(-.05, max(c(pre.phi, max(post.phi.mean)))+.1)
ylim <- c(-.05, 1.05)

draw.mean.line <- function(x, y, col, lty) {
 lines(c(x,x), ylim, col=col, lty=lty)
 text(x, y, col=col, labels=sprintf("%.03f", x))
}

draw.samples.density <- function(chains, varname, col, lty) {
 samples <- chains[[1]][,varname]
 den <- density(samples)
 den$y <- den$y / max(den$y)
 lines(den, col=col, lty=lty)
 draw.mean.line(mean(samples), -.03, col, lty)
}

# draw the sample distributions, as well as the given distrbution
png(paste(base.filename, 'png', sep='.'), width=1000, height=800)
plot(NULL, xlim=xlim, ylim=ylim, xlab=NA, ylab=NA, main=base.filename)

for (idx in 1:n.sides) {
 # actual
 draw.samples.density(die.mcmc, paste("phi[",idx,"]", sep=""), colors[idx], 1)

# show the original chosen values for phi
 draw.mean.line(pre.phi[idx], 1.03, colors[idx], 4)

# plot the algebraic answer, scaled to have max 1
 curve(dbeta(x, post.phi.marginal[idx,1], post.phi.marginal[idx,2]) / post.phi.max[idx], lty=2, add=TRUE, from=0, to=xlim[2])
 text(post.phi.mean[idx], .5+.1*(n.sides/2)-.1*(idx), labels=sprintf("%.03f",post.phi.mean[idx]))
}
legend("topleft", col=colors, lty=1, legend=paste("[",1:n.sides,"]",sep=""), title="phi")

dev.off()

For grins, here’s the output of one of my runs: Sample run, learning the weights of the sides of a weighted die Can you work out how many rolls there were for each of the sides? Having a script like this makes it pretty easy to investigate some questions:

  1. What happens if I have more rolls (N)? Fewer?
  2. If I pick a bad prior (e.g., a \beta with large values), what happens to the output? How can it be compensated for? To what extent?
  3. What about those iterations and burnin parameters? And just how does BUGS give us these samples anyway? Another day…

Basically there’s a bunch of knobs to twiddle, and all sorts of fun can be had. But let’s see if we can actually get closer to LDA, shall we?

Hierarchical Dice

Let’s add a layer of indirection. Suppose we’ve got a collection of weighted dice (K of them), each with the same number of sides (still V), but each possibly weighted differently. Let’s also suppose we’ve got a separate weighted die with K faces. Let’s call the die with K sides the T-die, and the other dice will be the W-dice. Now instead of just rolling a single die repeatedly, we’ll repeatedly (still N times) do the following:

  1. Roll the T-die to decide which of the W-dice to roll.
  2. Roll the appropriate W-die, and record the result.

Given the sequence of rolls (where a ‘1’ on one W-die is not distinguished from a ‘1’ on another W-die), along with K,V,N, can we determine the weights of the T-die and W-dice? Naturally, we should expect that this will go better when N is relatively large. Perhaps someday I’ll come back and fill in the algebra here, but for now let’s just see what we need to do in order to get OpenBUGS to help us out again. First off, we don’t know the weights of the T-die, and we want it to help us learn those. Let’s say they are \theta_1,\ldots,\theta_K, or more concisely the vector \theta. As before, this is a discrete probability distribution, and we’ve seen that a sample from a Dirichlet gives us such a beast. So it seems sensible to suppose that \theta is chosen from some Dir(x;\alpha), for some hyper-parameter \alpha (a K-dimensional vector). As before, we’ll feed in weakly informative priors. Next up, we have the weights of the sides on each of the W-dice. Before, we only had one such die, and the weights of its sides were \phi. So let’s stick with \phi for these weights, but now it must become a matrix – specifically, it has to be a K\times V matrix. Each row in this matrix is, again, a probability distribution, so we’ll assume \phi_{i,\bullet} is chosen from some Dir(x;\beta), where \beta is a single V-dimensional vector. All together, then, the model looks like:

model {
  theta[1:K] ~ ddirich(alpha[])
  for (k in 1:K) {
    phi[k,1:V] ~ ddirich(beta[])
  }
  for (n in 1:N) {
    z[n] ~ dcat(theta[])
    w[n] ~ dcat(phi[z[n],])
  }
  for (k in 1:K) {
    alpha[k] <- .1
  }
  for (v in 1:V) {
   beta[v] <- .001
  }
}

There’s enough parameters around that I haven’t bothered trying to make a nice clear picture of the outputs of this. The changes necessary to simply run the thing, beginning with the R code above, are pretty minimal – take in a couple extra parameters, generate samples via the two-step process, and figure out what you’d like to do with the outputs, and you’re set. Of course, I haven’t worked out the algebraically correct answers here… But we’re now knocking on the door of

LDA

Instead of only having a single T-die, suppose you’ve got several (M, say). After your N rolls using your first T-die, and their tabulation to n, from before, you decide to repeat the process with the second T-die (roll it N times, each time rolling the appropriate W-die and recording the answer), and then go through it all again with the third T-die, and so on. Each time you use the same set of W-dice… Ok, the dice viewpoint has gotten stretched far too thin. Who would have that many weighted dice, and be interested in all those rolls anyway? Let’s translate over to document collections. Suppose you’ve got a collection of M documents (a “corpus”, in the parlance of our time), each with N words (for notational convenience, there’s no reason they all really need to have the same length). Suppose, moreover, that each document is actually a blend of any of K topics, and that each document pertains to each topic in different proportions (weights of your T-dice). Finally, suppose each topic is represented by any of V words (finally, V for “vocabulary”!), and that the different topics have different weights for the V words (your W-dice all had potentially different weights for their sides, but all had the same number of sides). Given only the word sequences for the documents, the job of LDA is to uncover the topic distribution for each document (the weights of the sides of the associated T-die) as well as the word frequency distributions corresponding to each topic (the weights of the various W-dice). But really, this is only a step or two harder than uncovering the weights of a biased coin, which we began with. There’s sort of an assumption floating around in here that the order of words in documents doesn’t much matter. The brief look I had the original LDA paper [pdf] has me thinking that this has nice theoretical consequences concerning the form of the model for generating documents. Basically, under our assumption that documents are generated by repeatedly picking a topic (rolling a T-die) and then picking a word based on the word weights for that topic (rolling the appropriate W-die), the parameters have to be as described as in the following BUGS model:

model {
  for (k in 1:K) {
    phi[k,1:V] ~ ddirich(beta[]) # weights of W-die k
  }
  for (d in 1:M) {
    theta[d,1:K] ~ ddirich(alpha[]) # weights of T-die d
    for (n in 1:N) {
      z[d,n] ~ dcat(theta[d,])    # the n-th T-die roll for document d (choose a topic)
      w[d,n] ~ dcat(phi[z[d,n],]) # the n-th W-die roll for document d (choose a word from topic z[d,n])
    }
  }
}

I took the priors out, above; mostly I still feel like they’re a gap in my understanding. But whatever, good enough. It’s natural at this point to wonder how you’d take the word order into account anyway; I haven’t looked into it much, but there’s at least one paper [pdf] on the subject. One might also wonder how to to determine the number of topics if that constant (K) isn’t given to you with the word count tabulations. Again, this has been addressed [pdf], though I haven’t dug into it a whole lot. And let’s not forget, let’s not forget, that I still have basically no idea how or why Gibbs sampling (and/or other MCMC methods – Metropolis-Hastings, I’m looking at you) works. Finally, I was honestly a little surprised to find that the OpenBUGS model above, and the R script I coded up to use it, took something like 8 hours on sample data I generated to throw at it (maybe 1000 words, 3 documents, 2 topics, or so). I then looked at the lda package in R, manipulated my sample data so I could feed it to that, and got answers out in about a second. Clearly there’s something else going on there. My investigations so far have me thinking the difference is probably described in this paper, but it, along with everything else I’ve just mentioned, is still in the “todo” pile. So, more questions than answers, though I feel like I answered a few questions for myself in all of this. It’s probably some sort of life lesson that learning one thing just opens up the doors for not knowing a whole bunch more, or something. Hopefully I’ll keep chipping away on these or related topics (or, at least, something interesting). And maybe I’ll just find the energy to write about it, to help point out all the things I don’t understand. Anyway, thanks for coming with me on this. Sorry it took me so long, and left so many questions. In addition to the links above, I found this paper helpful/interesting.

Update 2015071: Corrected variable of integration in formula for $\Gamma(z)$, thanks to comment by Joel Nothman.

Advertisements

Printing in foreach’s dopar

November 5, 2011

I’ve recently been doing some work in R, and using the foreach package for some easy parallelization. I’d like to be able to spit out logging messages throughout my code, but the print lines always disappear when they are in the code executed by the %dopar% operator. So I’ve been working on trying to come up with a nice way to still capture/print those logging messages without having to change much of my existing code. I think my ultimate goal would be to write another binary operator which I can use as a drop-in replacement for %dopar%, which uses dopar under the covers, and which gets my print lines to the console. I haven’t quite gotten to that point yet, but I might be close-ish, and, either way, what I’ve got seems like it could either be useful for others, or something others could suggest how to fix.

My Setup

Following this post on closures in R, I set up most of my code as a bunch of ‘objects’ (environments). For example, I construct a logger object by calling the following function:

get.logger <- function() {
  logger <- new.env()
  logger$message <- function(str) {
    print(paste(date(), str))
  }
  environment(logger$message) <- logger
  logger
}

In sort of my ‘main’ code, I then do something like logger <- get.logger(), and then pass this logger around to my other objects which do more heavy lifting. For example, I might have a function:

guy.using.logger <- function(lgr) {
  guy <- new.env()
  guy$.logger <- lgr
  guy$do.stuff <- function() {
    foreach(i=1:10,
            .inorder=TRUE,
            .export='.logger',
            .init=c(),
            .combine=function(left, right) { c(left, right) }) %dopar% {
      .logger$message(paste('Working on piece', i))
      i^2
    }
  }

  environment(guy$do.stuff) <- guy

  guy
}

And then I’d have something like the following in my main code (in addition to setting up the parallel backend and such):

logger <- get.logger()
gul <- guy.using.logger(logger)
gul$do.stuff()

That final line return a vector with 10 elements, the squares of the integers 1 through 10, as expected. However, the thing I’m trying to overcome is that none of the expected print lines show up on the console.

Misguided Quest

I admit that this may be somewhat of a misguided quest. Changing the %dopar% to simply %do%, all of the print lines show up. And maybe I shouldn’t be doing objects the way I have them. I like to think an argument could be made for still trying to make my logging work within %dopar%. I’m not going to try to make an argument though. If you think it’s a useless mission, go find something else to read – there’s plenty of other interesting things online. Or stick around anyway, perhaps you’ll find something interesting here, or be able to help understand whatever it is I’m missing to make it work the way I want.

Buffering Messages

The basic idea I’ve been working with is to replace the logger with one that doesn’t actually print lines, but stores them up so they can be retrieved and printed later. So, for starters, I made a buffering logger:

get.buffering.logger <- function() {
  logger <- new.env()
  logger$.log <- c()
  logger$message <- function(str) {
    .log <<- c(.log, paste(date(), str))
  }
  logger$get.log <- function() {
    rv <- .log
    .log <<- c()
    rv
  }
  environment(logger$message) <- logger
  environment(logger$get.log) <- logger
  logger
}

Now, if I just make logger <- get.buffering.logger() and pass that to guy.using.logger, I can’t successfully ask for the messages after a call to do.stuff(). My guess is that the issue is the lack of thread-safety with vector/c(). This isn’t too upsetting, because I don’t really want to be buffering my messages all that long anyway. If I return the logger’s buffered messages as part of the results at the end of the expression evaluated in %dopar%, then in the .combine of the foreach, I could print those messages there. More explicitly, I could do:

complicated.guy.using.logger <- function(lgr) {
  guy <- new.env()
  guy$.logger <- lgr
  guy$do.stuff <- function() {
    combiner <- function(left, right) {
      cat(right$msgs, sep='\n')
      c(left, right$ans)
    }
    foreach(i=1:10,
            .inorder=TRUE,
            .export='.logger',
            .init=c(),
            .combine=combiner) %dopar% {
      .logger$message(paste('Working on piece', i))
      list(ans=i^2, msgs=.logger$get.log())
    }
  }

  environment(guy$do.stuff) <- guy

  guy
}

logger <- get.buffering.logger()
cgul <- complicated.guy.using.logger(logger)
cgul$do.stuff()

And I get all my print lines, and the same return value from the %dopar% call (the vector of the first ten squares) as before. Admittedly, I’m a little surprised that this seems to work, now that I think about it. The shared logger is buffering messages for several threads, which seems sort of troublesome. I guess they’re kept separate enough by the threads, or I’ve just gotten lucky the twice I’ve run the code above.

Jazzing Things Up

If I only had one or two foreach calls in my code, I’d probably just fix them as above and be content enough (if the last thread-safety concerns calmed down). However, there’s clearly room for some automation here. What has to happen?

  1. The expression evaluated by %dopar% needs to return whatever it did before, as well as the accumulated messages during the execution.
  2. The combine function needs to return what it did before, and print out the logging messages.

The easy part, from my perspective, is the second part. Let’s assume we can solve number 1, as we did in the previous section. That is, the new block will return a list, with an ‘ans’ element which has whatever was returned previously, and a ‘msgs’ element. Recall that foreach() actually returns an object of class foreach. Using summary() to get a sense of that object, it seems easy enough to write a function which takes a foreach object and returns a new foreach object which is basically the same as the original, but has a modified .combine function. Let’s try

jazz.fe <- function(fe) {
  r.fe <- fe
  r.fe$combineInfo$fun <- function(left, right) {
    cat(right$msgs, sep='\n')
    fe$combineInfo$fun(left, right$ans)
  }
  r.fe
}

We need a similar function which modifies the expression to be evaluated (the block of code on the right of %dopar%), so that the return values are manipulated as described above. I messed around a little bit with passing unevaluated expressions around, and came up with the following:

jazz.ex <- function(ex) {
  parse(text=c('r<-', deparse(ex), 'list(msgs=.logger$get.log(), ans=r)'))
}

And then these bits can be used as in the following guy:

jazzed.guy.using.logger <- function(lgr) {
  guy <- new.env()
  guy$.logger <- lgr
  guy$do.stuff <- function() {
    jazz.fe(foreach(i=1:10,
            .inorder=TRUE,
            .export='.logger',
            .init=c(),
            .combine=function(left, right) { c(left, right) })) %dopar% eval(jazz.ex(quote({
      .logger$message(paste('Working on piece', i))
      i^2
    })))
  }

  environment(guy$do.stuff) <- guy

  guy
}

This code should be compared with the very first guy.using.logger. The only difference between the two is that we wrapped the foreach in a function call, and also wrapped the expression in… a few calls. The ultimate goal of a drop-in %dopar% replacement is tantalizingly close. If all I need to do is call some function on the foreach object, and some other function on the expression, and then I can run %dopar%, that’s easy:

'%jdp%' <- function(fe, ex) {
  jazz.fe(fe) %dopar% eval(jazz.ex(ex))
}

And then I could just do

failing.ultimate.guy.using.logger <- function(lgr) {
  guy <- new.env()
  guy$.logger <- lgr
  guy$do.stuff <- function() {
    foreach(i=1:10,
            .inorder=TRUE,
            .export='.logger',
            .init=c(),
            .combine=function(left, right) { c(left, right) }) %jdp% quote({
      .logger$message(paste('Working on piece', i))
      i^2
    })
  }

  environment(guy$do.stuff) <- guy

  guy
}

logger <- get.buffering.logger()
fgul <- failing.ultimate.guy.using.logger(logger)
fgul$do.stuff()

No dice:

> fgul$do.stuff()
Error in e$fun(obj, substitute(ex), parent.frame(), e$data) : 
  unable to find variable ".logger"

Rats. I haven’t yet found the right combination of quote, eval, substitute, … to make this work, and I’ve tried several. I’ve been reading about environments and frames and lexical scoping and closures, and still haven’t gotten it right. If I put the definition of %jdp% in the do.stuff function, it actually works. But that’s ugly. Nothing else I’ve come up with works and involves as few changes as wrapping the two arguments to the %dopar% operator, as in the jazzed.guy.using.logger above.

So, if anybody’s got any suggestions, just let me know. In the mean time, I’ll either poke around in the %dopar% source for inspiration, or move on to other things.

A Math Prezi

November 21, 2010

Recently I had to give a math talk about my work. Previous talks I’ve given I just did on the chalkboard, but this being my last math talk for a while, I thought I might finally try Beamer (nice quickstart). And then I realized I should try Prezi. I thought I’d share my exploration with you, in case you try to decide about something similar. If you want to just cut to the chase and see the final Beamer work (pdf or source tarball), or the final Prezi, go for it. If for some strange reason you actually care about the math, you’re welcome to read the paper (pdf or source tarball) my presentation was based on. My final recommendation: stick with Beamer (until Prezi starts handling TeX).

I started off making a Prezi just to see how it worked, how easy it was to use and such. It’s pretty simple to use, which is nice. Of course, as one comes to expect, it doesn’t handle LaTeX. Ok, so, no worries, I’ll just do some TeX to make pictures I need, and insert pictures. Prezi will actually let you put in pdfs. So what you could do is make a beamer presentation, with a very basic template and no titles, and then use “pdftk” to “burst” the pdf, making a single pdf for each frame, and then use ImageMagick‘s “convert” to change the file type, do some cropping and trimming and things (if desired), and then load all those little pictures where you want them. You probably won’t want to use the actual pictures as path steps in your prezi, because it’ll zoom all the way in, but that’s ok, because you can just wrap a hidden frame where you want it.

So I then made another prezi, putting text where I wanted, thinking I’d then go in and make lots of little pictures, and put them where I wanted. This prezi did lots more zooming and twisting, so it was sorta fun, but I did worry a little if it might turn some people off (or make them motion sick :)). And then I realized I didn’t really want to make all of those little pictures, that the fonts wouldn’t match, and that I’d probably still have resolution issues (I couldn’t find an easy way to make, say, svgs, from tex).

Ok, so, maybe I could cheat a little bit. You can get away with a lot if you just use italics for math. I thought maybe I could use this for most of the math, and then rely on fewer pictures to have to insert. Sadly, prezi won’t do italics (or underline, or bold!). That was fairly surprising to me. No LaTeX I basically expect, but no italics? Well, ok, maybe I can cheat another way. Surely there’s Unicode characters for most of what I want, I could just type those in. But no, prezi (I’d happily blame flash here, I don’t know what wasn’t really working) wouldn’t do that either. I’d type unicode characters, and nothing would show up. I’d copy a unicode character typed elsewhere, and try to paste it in, and nothing. Sigh.

I pretty much gave up at that point, and made a beamer presentation. But the prezi urge just wouldn’t die. I decided that if I took whole frames from my beamer presentation, and added those to my prezi, I would (a) have consistent fonts, (b) wouldn’t have lots of tiny pictures to upload, and (c) probably wouldn’t do as much twisting and spinning and zooming, and would maybe, then, end up with a better presentation. One could pdftk burst and convert like I mentioned before, but I think I was having some issues getting good resolution that way (looking back, I question this, so you may want to play around). So I decided I could take screenshots of every frame, when it was in full-screen mode, and use those as my pictures. ImageMagick’s “import” takes screenshots, and with the ‘window -root’ option, it grabs full-screen. I don’t know how to force xpdf to turn the page from outside the program, so I set up a quick little bash script that would beep, sleep 2 seconds, and then import a screenshot. Switch workspace to my full-screened xpdf (put ‘xpdf*FullScreenMatteColor: white’ in .Xdefaults and do ‘xrdb -merge .Xdefaults’ before running xpdf), and just press the spacebar after every beep. Badabing. 2-3 minutes later, and I’ve got a 1280×800 image of every frame from my presentation. Upload to prezi, twist, zoom, and you’re done.

Except, no. Prezi has the dis-fortune of having to work on any screen resolution. I don’t know what they’re magic zoomer does to decide how to zoom, but things don’t go great if you try to present my prezi fullscreen at a different resolution. And, unfortunately, I ended up in a room with a computer whose screen was at a different resolution, and that I wasn’t allowed to change. So I fell back on my beamer talk. :-/ People said it was good anyway.

According to the support forums and associated prezi, I maybe should have been able to figure this out. Perhaps converting to pngs was my downfall. I really thought I tried keeping things as pdfs. I’ve been wrong before. Oh well, it’s over now. And I did learn other fun stuff with all this fiddling.

While I was doing all this, I finally figured out how to use pstricks to do text in a circle (or along other paths). I think I’ve tried before, but never quite figured out what was going on with \PSforPDF, even if I was able to put text on a path. But this time I got it, thanks to this presentation [pdf] (which I probably looked at before, too). If you’re working on project.tex, put all the pstricks stuff in \PSforPDF blocks, run latex, dvips, and ps2pdf, eventually outputting project-pics.pdf. Then when you run pdflatex project.tex, since you’re doing pdflatex instead of latex, \PSforPDF probably expands to some sort of \includegraphics[project-pics], and imports the *-pics.pdf (making that -pics assumption about the filename) you just made. Good stuff. LaTeX will probably be one of the things I miss the most about getting out of mathematics in academia.

Palindromes in Python

October 23, 2010

So you may have noticed my lack of math posts recently. Things change, you know. I might still have a few left in me, at some point. Either way, this blog may start having more programming. I’m putting this one here instead of at Euler’s Circus, only because it’s not a Project Euler problem. Getting on toward time to consolidate my blogging… maybe when I’m done with grad school (soon!).

Anyway, right. So here’s a programming post. I’m not sure how I came across the Programming Praxis blog, but one of their recent posts caught my eye: find the longest palindrome in a string. Given a string, what is the longest palindrome contained in that string? I thought about it (actually thought about it, instead of just thinking I would like to think about it) for a few minutes this evening, and came up with a little something that pleased me. Python‘ll do that.

So the idea is to build an array whose n-th item is an array of all of the indices in the string where a palindrome of length n begins. I guess that means I think of this array as 1-based, instead of 0-based, but whatever. The reason I like this idea is that if you’ve found all of the palindromes of length less than n, you can easily pick out the indices where palindromes of length n start as those indices, i, where (1) the string has the same character at i and i+n-1, and (2) i+1 is a palindrome of length n-2. The other reason I like this idea is that it’s really easy in python:

# assume text is a string of only lowercase letters, nothing else
text = "astringiwouldreallyliketofindlongpalindromesin"
tlen = len(text) # convenience, it's a few characters shorter to type
# keep track of where palindromes of all lengths are
# longs[n][m] = 1 if there is an (n+1)-digit pali beginning at index m
# and longs[n][m] is undefined otherwise
# prep longs with the easy cases, 1- and 2-digit palindromes
longs = [ [n for n in xrange(0,tlen)],
          [n for n in xrange(0,tlen-1) if text[n] == text[n+1]] ]

# iterate
while len(longs[-1]) > 0 or len(longs[-2]) > 0:
    curlen = len(longs)
    longs.append([n for n in xrange(0, tlen-curlen)
                  if text[n] == text[n+curlen]
                  and n+1 in longs[-2]])

winners = longs[-3] # [-1] and [-2] are empty, after all
winlen = len(longs)-2
print "\n".join([text[n:n+winlen] for n in winners])

So, there’s that. All of the interesting work gets done in that one line in the loop. You gotta be a little bit careful, watching for off-by-ones, but this seems to work. I thought about trying some other algorithms to compare efficiency. But I like this one, even if I’d eventually find it isn’t the best in terms of efficiency. I had another little bit in there that I eventually realized was unnecessary, but I still thought was fun:

# build a dictionary, character: array, called locs, with
# locs[c] = [locations of occurences of 'c' in text]
locs = dict([(c,[]) for c in string.lowercase])
map(lambda nc:locs[nc[1]].append(nc[0]), enumerate(text))

There may be a built-in python-y way to do this, but I like this anyway. I guess I’m just a sucker for map. And list comprehension.

The Steinhaus Conjecture

December 23, 2009

Or, perhaps more appropriately, “A” Steinhaus conjecture, he/she (I’m guessing Hugo, so he. Perhaps I’ll look into it) seems to have made a couple. This conjecture (theorem) also goes by the name “The Three Gap Theorem”, or “The Three Distance Theorem”. Which is all a little annoying, I think. It makes looking for references 3 times as hard, I reckon. But it’s a pretty cool result, and I’m glad Dave Richeson brought it to my attention via his blog post on Three cool facts about rotations of the circle.

To write down the theorem, I’ll first introduce the notation \{x\} for the “decimal part” of a real number, defined as x-\lfloor x\rfloor, \lfloor x\rfloor being the largest integer no bigger than x. Since I’ll be thinking about positive x, it is the value of x if you ignore digits to the left of the decimal point. This seems to be fairly common notation. Anyway…

The theorem goes something like this:

Theorem: Suppose that 0<\alpha<1 is irrational. Let N be a positive integer bigger than 1, and consider the N points \{m\alpha\} for 0\leq m <N. These points partition the interval [0,1] into N subintervals. If the distances of these subintervals are calculated, there will be either 2 or 3 distinct distances.

The circle comes in by thinking of the interval [0,1] as a circle with circumference 1. To help visualize it, Dr. Richeson made a pretty sweet GeoGebra applet.

I think this is a pretty initially surprising theorem. My initial shock has worn off just slightly, now that I’ve played with pictures and dug through a proof, but it’s still a wonderful result. I mean, irrational values are supposed to do weird things, right? Their multiples should bounce all over the place in the unit interval. And yet, they partition the circle into just 2 or 3 differently-sized gaps? Crazy talk. Also, the theorem as stated above isn’t as strong as it could be… you can say a bit more about the distances. I think I’ll talk more about it in another post.

I started reading about this theorem, after Dr. Richeson’s post, in the paper by Tony van Ravenstein. As I was reading the proof I got hung up on some details, and found that consulting the paper by Micaela Mayero got me over those difficulties. The paper by Mayero is essentially a formal proof written for the Coq formal proof system, so it sort of makes sense that details will be pretty fully explained in it. Either way though, it’s really not a long or particularly difficult proof (you mostly play with some inequalities).

I may return, in a future post, to talking about the proof, and I’ll certainly come back and tell you as I read more about further consequences and generalizations, and whatever else I find in some other papers I’m planning on looking at. But for now, let me mention a result in van Ravenstein’s paper. He proves that in going from the picture with N points to the picture of N+1 points, the N+1-th point will break up the oldest of the intervals with the largest length. The “age” of an interval is pretty intuitive. If a particular interval, say between multiples \{p\alpha\} and \{q\alpha\} comes in when there are N_0 points, and those two points are still neighbors when there are N_1 points, then the age of that interval, at stage N_1, is N_1-N_0 (plus 1, if you want, it doesn’t matter).

To help picture what’s going on, I made an interactive Sage notebook. If you have an account on sagenb.org, or have Sage installed on your own computer and want to just copy my code over, you can look at my code and play with the notebook. I had hoped that publishing my notebook there would let you play with the interactive bits without you needing an account, but no dice. Sorry.

To give some sense of my notebook, and the theorem, I’ve got some pictures for you.

First, let’s take \alpha=0.3826 or so (basically 1 minus the golden ratio, nice and irrational). I’ve set up my notebook so that points travel from 0, at the top of the circle, clockwise, because that’s how it was done in the papers I was reading, and I thought it’d be less confusing. So here’s the starting picture, when there’s just the points 0 and \alpha:

Along the outside of the circle, at each dot, I list which multiple it is. The “newest” dot is magenta, instead of red (probably not great color choices… mess with my code and make your own :)). In the center of the circle I list the lengths of the intervals, in decreasing order. Along each interval, I also write the age of that interval, and color-code the text to the list of distances. I’ve decided to always have the largest length be red-orangeish, the smallest length blue-ish, and the middle length (if there is one) green-ish.

In the picture above, the interval on the left is clearly the oldest of the longest length intervals, so the theorem is that when we add another point, this interval will get broken up by that point. Which is clearly true in this case.

Here’s another picture, using the same \alpha, but slightly more points, showing that three gaps occur:

And, finally, 20 points:

Here’s a picture using a starting \alpha a little bigger than 0.6, showing 20 points:

I like how the points seem to develop in clusters (also evidenced by Dr. Richeson’s app).

I guess that’s probably enough for now. Like I said, I’m hoping to have plenty more to say about things related to all of this soon…

Postscript: I want to make a few shout-outs. I thought putting them at the end of this post might interrupt any sort of flow of the article (if there is any) a little less.

mixedmath pointed out in the comments that public sagenb notebooks are currently (20130623) down. The code looks terrible in the comments, so I figured I’d just add it here:

defalpha = 0.38197 # golden ratio, ish
maxN = 20 # maximum number of points to put in the circle
tolerance = 10^(-7) # to decide when two floats are equal

# some colors, lower index corresponds to bigger distance
segcolor = [(0.86,0.28,0.06),(0.52,0.80,0.06),(0.19,0.11,0.60)]

# the unit circle
basepic = circle((0,0),1,rgbcolor=(0,0,0))

# floating part of a number
flpart = lambda v: v-int(v)

# a point v units along the circumpherence (of length 1 unit) at radius R
coords = lambda v,R: (R*sin(2*pi*v), R*cos(2*pi*v))

# draw dots on the circle, distinguish the "newest" by color
olddot = lambda v:circle(coords(v,1),0.02,rgbcolor=(1,0,0),fill=True)
newdot = lambda v:circle(coords(v,1),0.02,rgbcolor=(1,0,1),fill=True)

# storage
picturestore = {}

def addtopicturestore(alphaval):
    """ Make the picture for alphaval and all (up to maxN) numbers of points """
    picture = [basepic for m in range(0,maxN+1)] # to go into storage
    multiple = [flpart(m*alphaval) for m in range(0,maxN+1)] # the points
    
    # we care most about which distances are longest/shortest, and how
    # long each interval has been a certain distances
    # we'll build up these next few arrays as we increment the number of points
    # disttosucc[m] = actual distance to next point
    # agethisdist[m] = how long the interval after the point has been this distance
    # distsize[m] = 0,1,2 if the interval after point m is a big,med,or small interval
    disttosucc = [1] + [-1 for m in range(0,maxN)]
    agethisdist = [1] + [-1 for m in range(0,maxN)]
    distsize = [0] + [-1 for m in range(0,maxN)]
    
    # now, build up to having all of the points
    # currently, we suppose we only know the 0 point
    for N in xrange(2,maxN+1):
        
        newpt = multiple[N-1]
        
        # the new point breaks the oldest interval among those with biggest length
        # so find that interval
        oldestbigdist = distsize.index(0)
        for idx in xrange(oldestbigdist + 1, N-1):
            if distsize[idx] == 0 and agethisdist[idx] > agethisdist[oldestbigdist]:
                oldestbigdist = idx
        # newpt is the successor of oldestbigdist
        
        # update the only distances that change when adding this point
        splitdist = disttosucc[oldestbigdist]
        disttosucc[oldestbigdist] = newpt - multiple[oldestbigdist]
        disttosucc[N-1] = splitdist - disttosucc[oldestbigdist]
        
        # reset the age counts for these two new distances
        agethisdist[oldestbigdist] = 1
        agethisdist[N-1] = 1
        
        # now we recompute which distances are biggest/smallest
        
        # first, what are the 2 or 3 distances?
        distances = [disttosucc[oldestbigdist], 0, disttosucc[N-1]]
        if disttosucc[oldestbigdist] < disttosucc[N-1]:
            distances[0] = disttosucc[N-1]
            distances[2] = disttosucc[oldestbigdist]
        for idx in xrange(0,N):
            # we're using the fact that there are only 3 distances,
            # and that we already know two of them
            if disttosucc[idx] - distances[0] > tolerance:
                distances[1] = distances[0]
                distances[0] = disttosucc[idx]
            elif distances[0] - disttosucc[idx] > tolerance:
                if distances[2] - disttosucc[idx] > tolerance:
                    distances[1] = distances[2]
                    distances[2] = disttosucc[idx]
                elif disttosucc[idx] - distances[2] > tolerance:
                    distances[1] = disttosucc[idx]
            # while we're at it, update age of un-changed intervals
            if not idx == oldestbigdist and not idx == N-1:
                agethisdist[idx] += 1
                
        # now that we know the 2-3 distances, we can tell which points have which dist.
        for idx in xrange(0,N):
            smidx = 0
            while abs(distances[smidx]-disttosucc[idx]) > tolerance:
                smidx += 1
            distsize[idx] = smidx
            
        # finally, build the picture
        dots = [olddot(multiple[m]) for m in xrange(0,N-1)] + [newdot(multiple[N-1])]
        labels = [text(str(m),coords(multiple[m],1.1),rgbcolor=(0,0,1))
                  for m in xrange(0,N)]
        agelabels = [text(str(agethisdist[m]),
                           coords(multiple[m]+.5*disttosucc[m],.9),
                           rgbcolor=segcolor[distsize[m]])
                      for m in xrange(0,N)]
        distancelegend = text(str(distances[0]),(0,.1),rgbcolor=segcolor[0])
        distancelegend += text(str(distances[2]),(0,-.1),rgbcolor=segcolor[2])
        if distances[1]:
            distancelegend += text(str(distances[1]), (0,0), rgbcolor=segcolor[1])
        picture[N] += sum(dots)+sum(labels)+sum(agelabels)+distancelegend
        
    # outside the loop, all pictures have been computed, just store them
    picturestore[alphaval] = picture


# set up the interactive bits
@interact
def _( alpha=slider(0.0001,0.9999,0.0001,default=defalpha,label='Distance'),
       N=slider(2,maxN,1,default=2,label='Number of Points') ):
    if alpha not in picturestore:
        addtopicturestore(alpha)
    show(picturestore[alpha][N], axes=False, aspect_ratio = 1)

A Favorite Theorem

November 30, 2009

The other day I was hanging out with some math folk, and one asked the group what our favorite theorems are. A tough question, clearly, since there are soo many fantastic theorems to choose from. For me, there’s a place in my heart for slightly esoteric theorems. They should be surprising and seem to come from basically nowhere (unless you’ve been looking at the subject, perhaps). They should be fairly easy to state… possibly with a few minutes introduction of some easy definitions or results. And I seem to have taken up thinking that continued fractions are pretty cool. With that in mind, I present one of my favorite theorems:

Theorem: There is a constant, K, such that for almost all \alpha in (0,1), if the continued fraction representation of \alpha is [0;a_1,a_2,\ldots], then \sqrt[n]{a_1\cdots a_n}\to K as n\to\infty.

It turns out the constant, dubbed Khinchin’s constant, is K\approx 2.685, according to Wikipedia anyway. An exact expression for K is

\displaystyle K=\prod_{r=1}^{\infty}\left(1+\frac{1}{r(r+2)}\right)^{\log_2(r)}.

I’d like to try to give some very brief background on this theorem, along with the “few minutes introduction of some easy definitions and results”.

I should mention, first, that I take Khinchin’s book “Continued Fractions” as my primary reference. Hardy and Wright also have some chapters on continued fractions. And, of course, there’s Wikipedia.

Let’s begin with the process of constructing the continued fraction for an \alpha\in (0,1). I’ll take \alpha irrational, for convenience. Since 0<\alpha<1, we have 1<1/\alpha, and we let a_1 be the largest integer less than 1/\alpha, denoted \lfloor 1/\alpha\rfloor, and we know that a_1\geq 1. That leaves a little bit over, which I’ll denote z_1=(1/\alpha)-a_1\in (0,1). We have \alpha=1/(a_1+z_1), the first step in our continued fraction. You now iterate this process, looking at 1/z_1, setting a_2=\lfloor 1/z_1\rfloor and z_2=(1/z_1)-a_2, and obtaining

\displaystyle \alpha=\cfrac{1}{a_1+\cfrac{1}{a_2+z_2}}.

Since I picked \alpha irrational, this process keeps going – you keep getting a_n (all positive integers) and z_n (all in (0,1)). And then instead of writing huge nested fractions, you trim it down to \alpha=[0;a_1,a_2,\ldots]. That initial zero is just becuase I started with \alpha\in (0,1), you could instead let a_0=\lfloor \alpha'\rfloor, if you started with an \alpha' outside (0,1). I’ll continue assuming my values are in (0,1), so I’ll frequently just write [a_1,a_2,\ldots], dropping the initial zero.

This process gives us, for every \alpha\in(0,1), a sequence of values a_1,a_2,a_3,\ldots. Restated, we have functions a_n (and z_n as well) from (0,1) to positive integers. To start getting at Khinchin’s constant, you think about combinations of a_n^{-1}(k), for collections of ns and ks. That is, given n_1,n_2,\ldots,n_s and k_1,k_2,\ldots,k_s, all positive integers, what does the set of all \alpha=[a_1,a_2,\ldots] such that a_{n_i}=k_i for i=1,\ldots,s look like?

Let’s start with a_1, since it is the easiest. What does a_1^{-1}(k) look like? Well, a_1(\alpha)=k means that \lfloor 1/\alpha\rfloor=k. Which is to say, k\leq 1/\alpha<k+1, or 1/(k+1)<\alpha\leq 1/k. So the graph of a_1(x) is a step function, taking the value k on the interval (1/(k+1),1/k]. I picture this as a staircase descending as you move to the right. Of course, the stairs are pretty narrow on the left…

As an example, a_1^{-1}(3) is (1/4,1/3], the third stair from the right.

Now, on to a_2. What does a_2^{-1}(k) look like? Well, in finding the continued fraction for \alpha, we find first \alpha=1/(a_1+z_1), and if a_2(\alpha)=k then \lfloor 1/z_1\rfloor =k, or 1/(k+1)<z_1<1/k. So we find that for any a_1, all values between 1/(a_1+1/k) and 1/(a_1+1/(k+1)) have a_2=k. That is, in each of the “stairs” from the discussion of a_1, the graph of a_2 has another staircase, this time ascending as you move from left to right.

For example, a_2^{-1}(2) comprises, in the ascending staircases in each interval (1/(k+1),1/k], the second stair from the left. It is an infinite collection of intervals. If you also ask about those where a_1 is some constant, you pick out a single subinterval.

So we’ve got at least a little idea about how these functions a_n work. Given n_1,\ldots,n_s and k_1,\ldots,k_s as above, the set of all \alpha with a_{n_i}=k_i will be an infinite collection of subintervals of (0,1). Next you could ask, what is the measure (sum of lengths of intervals) of this set? I’ll mostly talk about the case when s=1, n_1=n, and k_1=k. I’ll denote the set of appropriate \alpha by E\binom{n}{k}, and the measure of that set by \mathcal{M}\binom{n}{k}.

Apparently Gauss had a go at questions like this. He defined m_n(x) to be the measure of the set of all \alpha such that z_n(\alpha)<x, and found that

\displaystyle \lim_{n\to\infty} m_n(x)=\frac{\ln(1+x)}{\ln 2}.

Crazy (brilliant) old Gauss (who apparently had an awesome signature). Khinchin suggests that Gauss probably arrived at this result based on a recurrence relation (functional equation, if you’d rather) among the m_i. I’ll let you read about it in Khinchin’s book. Anyway, in 1928, Kuz’min came along and found a generalization. Kuz’min’s theorem gives a more precise result than Gauss’ (at least, as stated above). Namely:

Theorem: There are positive constants, A and \lambda, such that

\displaystyle \left|m_n'(x)-\frac{1}{(1+x)\ln 2}\right|<Ae^{-\lambda\sqrt{n}}.

On integrating and taking limits, we obtain Gauss’ result. Alternatively, using the relation

\displaystyle \mathcal{M}\binom{n}{k}=m_{n-1}(\tfrac{1}{k})-m_{n-1}(\tfrac{1}{k+1})=\int_{1/(k+1)}^{1/k}m_{n-1}'(x)\ dx,

one can also obtain

\displaystyle \left|\mathcal{M}\binom{n}{k}-\log_2\left(1+\frac{1}{k(k+2)}\right)\right| < \frac{A}{k(k+1)}e^{-\lambda\sqrt{n-1}}.

Or, taking limits,

\displaystyle \lim_{n\to\infty} \mathcal{M}\binom{n}{k}=\log_2\left(1+\frac{1}{k(k+2)}\right).

This, by itself, seems pretty interesting. I find it a hard limit to interpret. After all, E\binom{n}{k} and E\binom{n+1}{k} have nothing to do with eachother, so the measure that we are calculating is of a completely different set with each n. If, instead, we were taking measures of a sequence of nested subsets, or something, I feel like I could get a better handle on this theorem. But we’re not. If anybody has a nice interpretation of this limit, I’d love to hear it.

Anyway, it’s about time for Khinchin’s theorem:

Theorem: If f is a function from positive integers to positive reals such that f(r)<C\cdot r^{\frac{1}{2}-\delta} for some positive constants C and \delta, then for almost every \alpha=[a_1,a_2,\ldots] in (0,1),

\displaystyle \lim_{n\to\infty}\frac{1}{n}\sum_{k=1}^{\infty}f(a_k)=\sum_{r=1}^{\infty}f(r)\log_2\left(1+\frac{1}{r(r+2)}\right).

Khinchin’s constant is obtained from the case when f(r)=\ln(r), simply messing with the equality using log and exponent rules.

Another interesting case is when you take f(r)=1 if r=\ell and 0 otherwise. Then the limit on the left of the equality in Khinchin’s theorem should be interpreted as the density of \ell among the values of the a_n in the continued fraction for \alpha. The expression on the right in the equality simplifies to a single term, instead of an infinite series. For example, with \ell=1 we find that approximately 42% of the terms (Khinchin calls them “elements”) in the continued fraction for just about any \alpha you pick will be 1. Khinchin summarizes the result:

“Thus, an arbitrary natural number occurs as an element in the expansion of almost all numbers with equal average frequency.”

Pretty awesome. Pick a positive integer. Pick any \alpha you want (pick several!). Find all of the elements in the continued fraction for \alpha, and the percentage that are your chosen integer (it will be a positive percentage!). You’ll get the same percentage for all of the \alpha you pick. Unless, of course, you have terrible luck (or are malicious) in picking \alpha. When you’re done with all of that, you might check, as a fun little algebra exercise, that the sum of all of the expected densities, over all choices of \ell, is 1, as it should be.

There are a few other interesting results like this. For some reason I have a harder time remembering them, and so they didn’t get to be my favorite theorem. But they’re just as cool, really.

For the first, I should remind you that the n-th convergent for a continued fraction [a_1,\ldots] is the fraction [a_1,\ldots,a_n]. As this is a finite continued fraction, it represents a rational number, and it is typically denoted p_n/q_n (relatively prime). The growth of the denominators in the sequence of convergents in a continued fraction is a fairly worthwhile question. Using some elementary relationships among the a_i and q_i, you can find that a_1\cdots a_n<q_n<2^na_1\cdots a_n. The interesting theorem I have in mind is another “there is a constant” theorem:

Theorem: For almost all \alpha=[a_1,\ldots],

\displaystyle \lim_{n\to \infty}\sqrt[n]{q_n}=e^{\pi^2/(12\ln 2)}.

This constant gets called the Lévy-Khinchin constant on Wikipedia.

Another interesting result, which I stumbled across while preparing a talk about some of these things, is called Lochs’ theorem. I know even less about this than the previous theorems (it isn’t in Khinchin’s book), but apparently it basically says that each time you find a new a_n, in the process of finding the continued fraction for \alpha, the convergent has one better decimal place of accuracy than the previous convergent. So continued fractions are just as expressive as decimals. They just don’t add together quite so easily 🙂

So anyway, what’s your favorite theorem?

Finale!

November 27, 2009

I think I’m about ready to tie together Farey sequences and the Riemann zeta function (specifically, the Riemann hypothesis). I know there were lots of gaps along the way, and lots more to explore from here, so that all gets to be the dénouement, if/when I get to it.

Yesterday we ended with the identity

\displaystyle \frac{1}{\zeta(s)}=\sum_{n}\frac{\mu(n)}{n^s}.

Define M(x)=\sum_{n\leq x}\mu(n), which you might call the cumulative Möbius function. If you are comfortable with your measure theory, you can decide that there is a measure, which I’ll denote dM, such that \int_a^b dM=M(b)-M(a). And if you’re ok with that statement, you probably also don’t mind writing the identity for the reciprocal of zeta, above, as

\displaystyle \frac{1}{\zeta(s)}=\int_0^{\infty} x^{-s}\ dM.

If you then integrate by parts you find that

\displaystyle \frac{1}{\zeta(s)}=s\int_0^{\infty}x^{-s-1}M(x)\ dx.

Now if M(x)\leq x^a (we could say something more like M(x)=O(x^a)) for some positive real value a, then this most recent integral will converge for \text{Re }s>a. Which means that \zeta(s) would have no zeroes in the half-plane \text{Re }s>a. Since there are zeroes on the critical line, where \text{Re }s=1/2, we know that no value of a less than 1/2 will work. But if you can show that M(x)\leq x^{(1/2)+\epsilon} for every positive \epsilon, then you’ll have proven the Riemann hypothesis (and earned yourself a cool million).

Ok, so now we just have to relate the Farey sequences to this M(x) somehow.

It’s been a while, so perhaps we should recall some basics about the Farey sequences. The n-th Farey sequence, denoted F_n consists of the reduced fractions p/q in [0,1] with q\leq n, ordered as usual. For today, let’s actually remove 0 from the sequences. Thinking about how many terms are in F_n, we quickly determine that there are \sum_{i\leq n}\phi(i), where \phi(i) is the number of positive integers less than i that are relatively prime to i (with \phi(1)=1). Let me call this sum \Phi(n), so that F_n has \Phi(n) terms.

To keep my notation under control, fix an n. Then for 1\leq v\leq \Phi(n), let r_v be the v-th term in the Farey sequence F_n. The terms in F_n are not equally spaced in [0,1], so let us also consider the sequence of \Phi(n) equidistant points in [0,1]. These will be the values s_v=v/\Phi(n). We’re interested in how far the Farey sequence is from this equidistant sequence, so let’s set \delta_v=r_v-s_v.

If f:[0,1]\to \mathbb{R}, then you can show (essentially by Möbius inversion) that

\displaystyle \sum_{v=1}^{\Phi(n)}f(r_v)=\sum_{k=1}^{\infty}\sum_{j=1}^{k} f(j/k)M(n/k).

The idea is that the function D(m) that is 1 for m>1 and 0 for m<1 is also

D(m)=\sum_n M(m/n),

because you can write (fairly cleverly, I feel)

M(m)=\sum_n \mu(n)D(m/n).

By the way, I’m being a bit loose with my values of these stepwise things when they actually make their steps. Apparently the convention is to define the value at the change to be the midpoint of the values on either side. I feel like it’s a technicality I’m not hugely interested in right now.

Now we apply this identity to f(u)=e^{2\pi iu}, obtaining

\displaystyle \sum_v e^{2\pi ir_v}=\sum_k M(n/k)\sum_j f(j/k).

That inner sum on the right-hand side is the sum of k unit vectors equally distributed around the unit circle, and so is 0 except when k=1. So we obtain

M(n)=\sum_v e^{2\pi ir_v}.

Finally, replace r_v with s_v+\delta_v. After shuffling some symbols around, you can write

M(n)=\sum_v e^{2\pi s_v}(e^{2\pi i\delta_v}-1)+\sum_v e^{2\pi is_v}.

Taking absolute values on each side, using the triangle inequality, and the identity we already used about sums of equally spaced unit vectors, we can write

\begin{array}{l}|M(n)|\leq \displaystyle \sum_v \left|e^{2\pi i\delta_v}-1\right| =\sum_v \left| e^{\pi i\delta_v}-e^{-\pi i\delta_v}\right|=2\sum_v \left|\sin \pi\delta_v\right|\\ \displaystyle \qquad \leq 2\pi \sum_v \left|\delta_v\right|.\end{array}

Using our lemma from earlier, about how to obtain the Riemann hypothesis from M, we have, at last:

If \sum_v |\delta_v| is O(n^{(1/2)+\epsilon}) for every \epsilon>0, then the Riemann hypothesis is true.

Hurray! The result that encouraged me to wander down this path!

So, that was fun. It turns out that both of today’s results that imply the Riemann hypothesis are actually equivalent to the Riemann hypothesis, but perhaps that’s a topic for another day…

[Update 20091203: Noticed I had some absolute value bars in the wrong places in the line before the final theorem, and corrected them (hopefully)]

Möbius Inversion

November 25, 2009

Yesterday I mentioned Möbius inversion, and I decided that it was worth looking at a little bit more. First of all, it’s pretty fun. Secondly, it’ll give us a useful identity concerning \zeta(s).

Before I get too far, I should add another reference to my growing list. With Edwards’ book, and Hardy and Wright, include Apostol’s “Introduction to Analytic Number Theory.”

Anyway, yesterday I defined the Möbius function

\displaystyle \mu(n)=\begin{cases}1 & n=1 \\ 0 & n\text{ not squarefree} \\ (-1)^k & n=p_1\cdots p_k\text{ distinct primes}\end{cases}

and stated Möbius inversion as follows:

Thm: If f(n)=\sum\limits_{d|n}g(d) then g(n)=\sum\limits_{d|n}\mu(d)f(n).

Let’s begin today with a proof of this statement. Pf: Suppose f(n) is as stated, and consider \sum\limits_{d|n}\mu(d)f(n). By assumption on f, this is \sum\limits_{d|n}\mu(d)\sum\limits_{d'|(n/d)} g(d') which I’ll write as \sum\limits_{dd'|n}\mu(d)g(d'). Now in this sum, what is the coefficient of g(m)? Well, it is \sum\limits_{dm|n}\mu(d)=\sum\limits_{d|(n/m)}\mu(d). We saw yesterday that this sum is 0 unless n/m=1, i.e., unless m=n. So \sum\limits_{d|n}\mu(d)f(n)=g(n), as claimed.

That’s pretty fun, I think.

There’s another way to go about these things, and it is by way of what are called Dirichlet series (according to Wikipedia, Dirichlet had many names). Suppose f is a function defined on the positive integers (such a function seems to sometimes get called an arithmetic function). I’ll say that the Dirichlet series for f is the series F(s)=\sum_n f(n)n^{-s}. I’ve written it this way, so that I have a function defined whenever the series converges, but I’ll pretty much ignore convergence aspects in what follows.

Suppose F(s) and G(s) are the Dirichlet series associated to some f,g, respectively. What does the product of these two series look like? Well, since f(n)n^{-s}\cdot g(m)m^{-s}=f(n)g(m)(nm)^{-s}, you can determine that the product of these two Dirichlet series is another Dirichlet series, this one associated to the function c(n)=\sum\limits_{d|n}f(d)g(n/d). Indeed, just ask yourself, “what is the coefficient of n^{-s} in the product of the series F(s) and G(s)?”

By the way, I’ve called the product series c because it also gets called the convolution of f and g. The convolution of f and g is denoted f*g, and is simply the function (f*g)(n)=\sum\limits_{d|n}f(d)g(n/d).

Now you’ve got an operation, *, on arithmetic functions, and you can start asking what properties it has. It turns out that * gives the set of arithmetic functions (well, those functions that aren’t 0 on the input 1) the structure of a commutative group. The identity is the function I with I(1)=1 and I(n)=0 if n>1 (corresponding to the identity Dirichlet series, the constant function 1). It’s sort of a fun little induction problem to find a formula for the inverse of a given arithmetic series. We won’t need the formula exactly, so perhaps I’ll leave it for you.

Let z(n)=1, so that the associated Dirichlet series is \zeta(s).

We now have another language in which to state Möbius inversion:

Thm: If f=g*z then g=\mu*f.

In this statement, let g=I (the identity of convolution) so that f=z. Then we obtain I=\mu *z. On taking Dirichlet series, we find that

\displaystyle 1=\left(\sum_n \frac{\mu(n)}{n^s}\right)\zeta(s),

or, in other words,

\displaystyle \frac{1}{\zeta(s)}=\sum_n \frac{\mu(n)}{n^s}.

How’s that for an identity?

There’s another way to get at this identity, without the language of Dirichlet series (explicitly anyway), which I wrote “holy crap” next to in the book (Hardy and Wright) when I read it. So probably I should share. Using Euler’s product (recall the indexing is over primes)

\displaystyle \zeta(s)=\prod_p (1-p^{-s})^{-1}

we have

\displaystyle \frac{1}{\zeta(s)}=\prod_p (1-p^{-s}).

Here’s the part that rocked me: Look at 1-p^{-s} and write it as 1-p^{-s}+0p^{-2s}+0p^{-3s}+\cdots, i.e. as \sum_k \mu(p^k)p^{-ks}. Because the first thing I think to do when I see a binomial is to write it as an infinite series of mostly 0s. Like I said yesterday, I guess it helps if you know where you are going. So anyway, we’ve got

\displaystyle \frac{1}{\zeta(s)}=\prod_p \sum_k \frac{\mu(p^k)}{(p^k)^s}.

Now, like we’ve been doing, let’s ask, “what is the coefficient of n^{-s} when the product is all written out?” Well, the only way to get n^{-s} is the term corresponding to the prime factorization of n. If n=p_1^{e_1}\cdots p_k^{e_k}, then the coefficient is \mu(p_1^{e_1})\cdots \mu(p_k^{e_k}). But \mu is “multiplicative,” meaning that if a and b are relatively prime then \mu(a)\mu(b)=\mu(ab). So the coefficient of n^{-s} is just \mu(p_1^{e_1}\cdots p_k^{e_k})=\mu(n). This proves the identity.

That’s the identity we’ll use in the next post or two to finally tie together Farey sequences and the zeta function, believe it or not. So I’ll basically stop here. But I do want to mention that you can do some of the things we’ve done above in slightly more general contexts, and I remember really enjoying reading about them. Namely, you can work in other posets. In what we’ve been doing above, there has, somewhere in the background, been the poset of positive integers ordered by n\leq m iff n|m. Probably a good place to start reading about the generalization is Wikipedia, as usual. If I find, sometime, the paper I read where I first saw these things, I’ll update this post.

In the mean time…

A Formula for Primes

November 24, 2009

I wanted to write this post yesterday, but I couldn’t work out some of the steps that I wanted to be able to work out, so I had to wait a day and try again. But I think I’ve got it now, let’s see…

We’ve been working with the function

\displaystyle J(x)=\sum_{p^n\leq x}\frac{1}{n}.

Last time, we arrived (skipping over some rather large gaps) at an analytic formula for J(x), namely:

\begin{array}{l}\displaystyle J(x)=Li(x)-\sum_{\text{Im }\rho>0}\left(Li(x^{\rho})+Li(x^{1-\rho})\right)\\ \displaystyle\qquad\qquad +\int_x^{\infty}\frac{dt}{t(t^2-1)\ln t}-\ln 2,\end{array}

where (mod issues)

\displaystyle Li(x)=\int_0^x \frac{dt}{\ln t}.

The goal today is to use this formula for J(x) to find a formula for primes. In particular, we will find a formula for \pi(x), which is the number of primes no larger than x. These two functions are related, as follows: recall that J(x) is a step function which jumps by 1/n at every prime power p^n. Pick an x_0. How many jumps of size 1 does J(x) make before getting to x_0? Well, it jumps by 1 at the primes, so it makes \pi(x_0) jumps of size 1. How many jumps of size 1/2? These are jumps at prime squares, p^2\leq x_0. This inequality is the same as p\leq x_0^{1/2}, so we make \pi(x_0^{1/2}) jumps of size 1/2. Similarly, we’ll make \pi(x_0^{1/n}) jumps of size 1/n to calculate J(x_0). Eventually (for some large enough N) x_0^{1/N} will be less than 2, and so \pi(x_0^{1/N})=0, and we can stop looking for jumps.

Translating the above into symbols, we have just decided that

J(x)=\pi(x)+\frac{1}{2}\pi(x^{1/2})+\frac{1}{3}\pi(x^{1/3})+\cdots+\frac{1}{n}\pi(x^{1/n})+\cdots.

I’ve written it as an infinite sum, but eventually (like I said) those \pi values will be 0, so we’ve really got a finite sum.

Now wait, our goal was a formula for \pi(x), in terms of J(x), and I’ve got a formula the other way, J(x) in terms of \pi(x).

This is the part I got stuck on yesterday. Edwards’ book says (or perhaps doesn’t, causing my problems) that basically you do Möbius inversion to the sum above, and obtain a formula for \pi(x) in terms of J(x). He says slightly more than that, but I wasn’t able to piece it together as easily as he seemed to indicate it should work. But anyway, what’s this Möbius inversion thing? I’ll take a pretty low-level account, leaving lots of room for things to be jazzed up (which I may return to and do sometime soonish).

An integer is said to be “squarefree” if it factors into the product of distinct primes or, equivalently, it is not divisible by a square. Define the following function, called the Möbius function, on positive integers (p_i is supposed to indicate a prime):

\displaystyle \mu(n) = \begin{cases}1& n=1 \\ 0 & n\text{ not squarefree} \\ (-1)^k & n=p_1\cdots p_k\text{ squarefree}\end{cases}

So, for example, \mu is -1 at any prime, \mu(6)=\mu(10)=1, and \mu(8)=\mu(12)=0.

Möbius inversion is then the following:

If f(n)=\sum_{d|n}g(d) then g(d)=\sum_{d|n}\mu(d)f(n/d), and conversely.

Well, I tried to figure out how my expression for J(x), in terms of \pi(x^{1/n})s was such an f of g. I don’t see it, as stated. Perhaps somebody can point me in the right direction in the comments.

The observation you are supposed to make comes when you consider something like \frac{1}{2}J(x^{1/2}). Using our formula above, we can write

\frac{1}{2}J(x^{1/2})=\frac{1}{2}\pi(x^{1/2})+\frac{1}{4}\pi(x^{1/4})+\frac{1}{6}\pi(x^{1/6})+\cdots

or, more generally, we have

\displaystyle \frac{1}{n}J(x^{1/n})=\sum_k \frac{1}{nk}\pi(x^{1/nk}).

If we subtract this equation (with n=2) from the equation for J(x) in terms of \frac{1}{n}\pi(x^{1/n})s, we can eliminate one term from the right-hand side. The idea is to iterate this process. I think a (reasonably, by my blog’s standards) clean way to write what is happening is as follows (it helps when you know the answer you are supposed to get :)):

Consider the sum

\displaystyle \sum_n \frac{\mu(n)}{n}J(x^{1/n}),

which by our most recent formula is the same as

\displaystyle \sum_{n,k}\frac{\mu(n)}{nk}\pi(x^{1/nk}).

Now, fix a positive integer m. The coefficient of \pi(x^{1/m}) in this last double sum is

\displaystyle \sum_{nk=m}\frac{\mu(n)}{m}=\frac{1}{m}\sum_{n|m}\mu(n).

Almost there… what is \sum_{n|m}\mu(n)? Turns out, it is 1 if m=1 (which is clear), and 0 otherwise. Indeed, suppose m has prime factorization p_1^{e_1}\cdots p_k^{e_k}. Then all of the divisors, n in the sum we’re after, are of the form p_1^{f_1}\cdots p_k^{f_k} where each 0\leq f_i\leq e_i. The only divisors, n, where \mu(n)\neq 0 are those where the f_i are all either 0 or 1. So the worthwhile n could be 1, or some p_i, or a p_ip_j (i\neq j), or p_ip_jp_{\ell} (i\neq j, i\neq \ell, j\neq \ell), etc. So

\begin{array}{l}\displaystyle \sum_{n|m}\mu(n)=1+\sum_i \mu(p_i)+\sum_{i,j} \mu(p_ip_j)+\cdots+\mu(p_1\cdots p_k) \\ \qquad \qquad \displaystyle =1-k+\binom{k}{2}-\binom{k}{3}+\cdots+(-1)^k=(1-1)^k=0.\end{array}

(By the way, this is (almost verbatim) the proof in Hardy and Wright’s “An Introduction to the Theory of Numbers”, and I’ll count that as one of my references for anything I say about \mu).

So, let’s gather back up here. We’ve now shown that in the sum

\displaystyle \sum_n \frac{\mu(n)}{n}J(x^{1/n})=\sum_{n,k}\frac{\mu(n)}{nk}\pi(x^{1/nk})

the coefficient of \pi(x^{1/m}) is 0 unless m=1, when the coefficient is 1. Therefore, we have obtained out formula for \pi(x):

\begin{array}{l} \pi(x)=\displaystyle \sum_n \frac{\mu(n)}{n}J(x^{1/n})\\ \qquad \displaystyle=J(x)-\frac{1}{2}J(x^{1/2})-\frac{1}{3}J(x^{1/3})-\frac{1}{5}J(x^{1/5})+\frac{1}{6}J(x^{1/6})-\cdots\end{array}

I guess the point of Riemann’s paper was that the terrible integral formula he obtained for J(x) could then be used in this expression for \pi(x), to give an exact analytic expression (that is, one only an analyst would find pretty) for \pi(x). Pretty wild, in my mind.

It seems that if you get some feeling for the terms in J(x), you can decide that the “dominant” term is the Li(x) term. And then that still ends up as the dominant term in the expression above for \pi(x). And so you get to say

\pi(x)\sim Li(x).

So you’ve got that going for you, which is nice.

Another Formula for J(x)

November 22, 2009

Yesterday, I related the logarithm of \zeta(s) to a piecewise linear function J(x). You may recall that J(x) was defined for positive reals by setting it equal to 0 at x=0, and then jumping by 1/n whenever x=p^n, for some prime p and integer n. At the end of the day, we got to

\displaystyle J(x)=\frac{1}{2\pi i}\int_{a-i\infty}^{a+i\infty}\ln \zeta(s) x^s\frac{ds}{s}

where a=\text{Re}(s)>1. Today, we’ll analyze \ln \zeta(s) some more, and re-write the formula above.

When I introduced \zeta(s), I ended with the following formula:

\displaystyle \zeta(s)=\frac{\Gamma(1-s)}{2\pi i}\int_{+\infty}^{+\infty} \frac{(-x)^s}{e^x-1} \frac{dx}{x}

where the bounds on that integral are supposed to represent a curve that “starts” at the right-hand “end” of the real line, loops around 0, and then goes back out the positive axis to infinity. I’m not good enough at complex line integrals at this point to say any more about this. But apparently if you are good at these sorts of integrals, using Cauchy’s integral formula and things you can find the so-called “functional equation”

\zeta(s)=\Gamma(1-s)(2\pi)^{s-1}2\sin(s\pi/2)\zeta(1-s).

If you then use the relation I mentioned previously:

\Gamma(s)\Gamma(1-s)\sin(\pi s)=\pi

(well, you use this for s=s/2), and one I haven’t mentioned:

\Gamma(1-s)=2^{-s}\Gamma(1-\frac{s}{2})\Gamma(\frac{1-s}{2})\pi^{-1/2}

and move some symbols around, you arrive at a more symmetric equation:

\Gamma(\frac{s}{2})\pi^{-s/2}\zeta(s)=\Gamma(\frac{1-s}{2})\pi^{-(1-s)/2}\zeta(1-s).

Notice that if you plug 1-s in the formula on the left-hand side, you obtain the right-hand side.

This function on the left-hand side apparently has poles at 0 and 1, so if we define

\xi(s)=\frac{s(s-1)}{2}\Gamma(\frac{s}{2})\pi^{-s}{2}\zeta(s)

then we obtain an entire analytic function satisfying \xi(s)=\xi(1-s). Using the factorial relation for \Gamma, we can re-write \xi(s) as

\xi(s)=(s-1)\Gamma(\frac{s}{2}+1)\pi^{-s/2}\zeta(s).

I get the impression that if you know what you are doing, then the things above aren’t tooo hard to justify. Apparently the next part is a bit trickier. Apparently, you can write

\xi(s)=\xi(0)\prod_{\rho}(1-\frac{s}{\rho}),

where the product is indexed over the roots, \rho, of \xi (so \xi(\rho)=0).

If you’ve heard anything about the Riemann hypothesis, you know that the roots (the “non-trivial” ones, I didn’t talk about the trivial ones) of \zeta(s) are a big deal. Our second formula for \xi(s) shows that they are (basically) the same as the roots of \xi(s), and so they are the \rho that the sum above is indexed over. The symmetric equation from earlier has a little something to say about the zeroes, and it has been shown that all of the zeroes have real part bigger than 0 and less than 1 (this is called the “critical strip”). The hypothesis (whose truth won’t affect what we’re saying below) is that all of the zeroes have real part 1/2 (this is the “critical line”). Apparently Riemann didn’t need this hypothesis for the things in his paper that introduced it, so I don’t really have much more to say about it right now. Although, honestly, I still don’t see what all the fuss is about 🙂 The formulas we’ll get below and tomorrow work even if the roots aren’t on the critical line (unless I’m missing something important. If I am, please comment).

Anyway, back to the topic at hand. Let me try to convince you that it isn’t horribly unreasonable to think about writing a function as a product over its roots, as I’ve done above. For the sake of example, let f(x)=3x^3+3x^2-30x+24 (or pick your own favorite polynomial). The usual way this would get factored, in all the classes I’ve ever taken or taught, is (up to permutation) f(x)=3(x+4)(x-1)(x-2), showing that the roots are x=1,2,-4. However, if you factor a 4 out of the x+4 term, and -1 and -2 out of the other terms, you can also write f(x)=24(1-\frac{x}{-4})(1-x)(1-\frac{x}{2}). You still see all the zeroes when you write the polynomial this way. You can also see that the coefficient in the front is f(0). So we’ve written f(x)=f(0)\prod_{\rho}(1-x/\rho), which is the same goal as what we’re doing with \xi above. Incidentally, the idea of writing a function this way was also used by Euler to establish \zeta(2)=\sum 1/n^2=\pi^2/6 (I’ve mentioned this briefly elsewhere).

We now have two formulas for \xi(s), so we can put them together to get

\xi(0)\prod_{\rho}(1-\frac{s}{\rho})=(s-1)\Gamma(\frac{s}{2}+1)\pi^{-s/2}\zeta(s).

Recalling that our formula for J(x), at the beginning, involved \ln\zeta(s), let’s take the log of the equation above and solve for the \zeta term:

\begin{array}{l} \ln \zeta(s)=\ln \xi(0)+\sum_{\rho}\ln(1-\frac{s}{\rho})\\ \qquad\qquad -\ln \Gamma(\frac{s}{2}+1)+\frac{s}{2}\ln \pi-\ln(s-1).\end{array}

The idea is now to plug this in the formula for J(x). Apparently if you do, though, you’ll have some issues with convergence. So actually try to do the integral in J(x), using integration by parts (hint: dv=x^s ds). The “uv” term goes to 0 and you obtain

\displaystyle J(x)=\frac{-1}{2\pi i}\cdot \frac{1}{\ln x}\int_{a-i\infty}^{a+i\infty}\frac{d}{dx}\left[\frac{\ln\zeta(s)}{s}\right]x^s\ ds,

where, as before a=\text{Re } s. Now plug in the 5 terms we’ve got above for \ln \zeta(s), and you get a formula for J(x). What happens to the terms? Can you actually work out any of the integrals?

Well, you might be able to. I’m not. Not right now anyway. But I can tell you about what others have figured out (rather like I’ve been doing all along, in fact)…

It’s clear that the \frac{s}{2}\ln \pi term drops out, because you divide by s and then take the derivative of a constant and just get 0. The term with \ln\xi(0) ends up just giving you \ln\xi(0), which is -\ln(2).

The term corresponding to the term with \Gamma in it can be rewritten as

\displaystyle \int_x^{\infty}\frac{dt}{t(t^2-1)\ln t}

(as if that were helpful).

The important terms seem to involve the function

\displaystyle Li(x)=\int_0^x \frac{dt}{\ln t}.

Of course, this integrand has a bit of an asymptote at 1, so really Li(x) (in Edwards’ book, anyway) is the “Cauchy principal value” of this integral, namely

\displaystyle Li(x)=\lim_{\epsilon\to 0^+}\int_0^{1-\epsilon}\frac{dt}{\ln t}+\int_{1+\epsilon}^x \frac{dt}{\ln t}.

This function is, rather famously, related to approximating the number of primes less than a given bound. In fact, tomorrow I plan on having more to say about this. But back to the terms in our integral for J(x).

The term corresponding to the sum over the roots ends up giving you

\displaystyle -\sum_{\text{Im }\rho>0}Li(x^{\rho})+Li(x^{1-\rho}).

But apparently the dominant term is the term corresponding to \ln (s-1). It actually gives you Li(x)

So, finally, we have written

\begin{array}{l}\displaystyle J(x)=Li(x)-\sum_{\text{Im }\rho>0}\left(Li(x^{\rho})+Li(x^{1-\rho})\right)\\ \displaystyle\qquad\qquad +\int_x^{\infty}\frac{dt}{t(t^2-1)\ln t}-\ln 2.\end{array}

Doesn’t that make you feel better? We started with the reasonably understandable

\displaystyle J(x)=\sum_{p^n\leq x}\frac{1}{n},

and created the monstrosity above. I guess this is why I’m not an analyst. To me, it seems worse to write J(x) as this terrible combination of lots of integrals. But apparently it’s useful in analysis to have such formulas. I guess we’ll see a use tomorrow…