Franklin's blog

Exploring finite central groupoids

2026-01-01

Exploring finite central groupoids

A few weeks ago I was craving a math puzzle but feeling uninspired, so I turned to the Equational Theories project for ideas. The project (which has now come to an end) aimed to prove as many implications as possible between algebraic identities of binary operations by crowdsourcing proofs written in the Lean programming language. In its completed state, the project is not only an incredible accomplishment but also a great source of mathematical chestnuts.

I settled on spending some time exploring central groupoids, which are defined as magmas (i.e. sets equipped with a binary operation $\cdot$) satisfying the following law: $(x\cdot y)\cdot (y\cdot z) = y$ For intuition, there's a whole family of easy-to-comprehend central groupoids that can be defined on sets of ordered pairs $A^2 = A\times A$. On a set of this form, the operation defined by $(x_1,x_2)\cdot (y_1,y_2) = (x_2, y_1)$ automatically satisfies the central groupoid law. Central groupoids of this form are called natural central groupoids. Part of the intrigue of central groupoids is that it's kind of tricky to exhibit families of non-natural central groupoids that are as intuitive as the natural ones. Even non-natural central groupoids have several properties that are loosely analogous to the central ones: for instance, all finite central groupoids have a perfect square number of elements (which we will prove later).

Digraph, matrix and powerset representations

Central groupoids can also be understood as a special kind of directed graph. If $D$ is a digraph with the property that any two vertices $v,w\in V(D)$ have a unique directed path of length 2 between them, then a binary operation $\cdot$ can be defined on $V$ such that $v\cdot w$ is the intermediate vertex on that unique path, that is, the unique vertex such that $v\to v\cdot w \to w$ in the graph $D$. It follows that this binary operation satisfies the central groupoid law: if $x,y,z\in V(D)$, then $x\cdot y\to y$ and $y\to y\cdot z$ are both edges, meaning that $y$ is intermediate on a path of length 2 between $x\cdot y$ and $y\cdot z$.

Conversely, every central groupoid arises from a digraph in this way, and given a central groupoid $C$ we can construct a digraph $D$ giving rise to it. Simply let the vertices $V(D)$ consist of the elements of $C$, and for each vertex $v\in V(D)$, let there be an edge joining $v\to v\cdot w$ for each $w\in V(D)$. This graph is guaranteed to have the uniqueness property of length-2 paths: if $v\to x\to w$ for some $v,w,x\in V(D)$, then we must have that $x = v\cdot v'$ and $w = x\cdot w'$ for some $v'$ and $w'$, implying that $(v\cdot v)\cdot x = v$ and hence $v\cdot w = x$, meaning that all intermediate elements on a length-2 path from $v$ to $w$ must equal $v\cdot w$.

Digraphs $D$ can also be represented by incidence matrices $M$, that is, sets of 0s and 1s such that a value of 1 in row $i$ and column $j$ indicates and edge $i\to j$ in the digraph. The length-2 path uniqueness property in the digraph $D$ for a central groupoid has a neat translation into matrix language: it's equivalent to $M^2 = J$, where $J$ is a matrix of all 1s. So central groupoids also correspond to zero-one matrices whose square is the identity matrix.

When working with concrete examples by hand, I find a different representation of central groupoids more convenient, since drawing out digraphs gets messy very quickly. A central groupoid $C$ can also be represented by a function $f: C\to \mathcal P C$ — that is, an assignment of a subset of $C$ to each element of $C$ — with the special property that for each subset $f(x)\subset C$, the set ${f(y) ~ : ~ y\in f(x)}$ comprises a partition of $C$. In terms of the central groupoid operation, the set $f(x)$ would be the set of all left-images of $x$, that is, the set of values $x\cdot y$ for $y$ ranging over $C$. In terms of the digraph representation $D$ for the central groupoid $C$, for a vertex $v\in D$, the value $f(v)$ would give the set of all vertices that are the target of an edge originating from $v$.

There's only one central groupoid of order 4, namely the natural one, so let's look at a couple concrete examples of order 9. Here's how I visualize the natural central groupoid of order 9. On the right is a table describing the function $f$, which maps each $x\in C$ to a subset of $C$. On the left is how I picture the digraph representation of $C$ without drawing a tangled mess of edges: for each vertex, the color of the vertex indicates that it should have directed edges pointing to each vertex in the set of the same color. For instance, $v_1$ has outward-pointing edges $v_1\to v_1$, $v_1\to v_2$ and $v_1\to v_3$ because it is red, and the red-colored group contains vertices $v_1,v_2,v_3$.

Values of the central groupoid operation can be read off from these diagrams. To compute, for instance, $v_1\cdot v_5$, you can follow these steps:

identify the vertices that $v_1$ points to (they are $v_1,v_2,v_3$)
identify which of those vertices points to $v_5$ ($v_2$ does, since it is orange and $v_5$ is in the orange group)
this vertex $v_2$ is the value of $v_1\cdot v_5$

Here's an example of a central groupoid of order 9 that is not natural:

You can check for yourself that it satisfies the required property, namely that for each 3-element set of elements ${x,y,z}$, the collection of sets ${f(x),f(y),f(z)}$ comprises a partition of ${1,2,\cdots,9}$. Note that there are only two distinct partitions to chosoe from here.

Cardinality and idempotent elements of a finite central groupoid

We shall see that each element $\alpha$ of a finite central groupoid $C$ determines a "local coordinate system" for $C$, and that consequently $C$ must have a perfect square number of elements.

Let $\alpha\in C$ be an arbitrary element of a central groupoid. Let the set of left-images of $\alpha$ be denoted $\alpha\cdot C$, and let the set of right-images of $\alpha$ be denoted $C\cdot \alpha$. In a natural central groupoid of order $n^2$, the set $\alpha\cdot C$ would consist of all points of the form $(\alpha_2, -)$ and $C\cdot \alpha$ would consist of all points of the form $(-,\alpha_1)$, so that each would have precisely $n$ points. Although a non-natural central groupoid does not come equipped with such a global coordinate representation, it remain the case in these central groupoids that $\alpha\cdot C$ and $C\cdot \alpha$ have the same number of elements. In particular, we have the following bijection $\alpha\cdot C \simeq C\cdot \alpha$: $\begin{align*}g_\alpha ~ &: ~ x \mapsto x\cdot\alpha \\ g_\alpha^{-1} ~ &: ~ x \mapsto \alpha\cdot x \end{align*}$ To see that this is a bijection, let $x=\alpha\cdot x'\in \alpha\cdot C$ be arbitrary and observe that

$\begin{align*}g_\alpha^{-1}(g_\alpha(x)) &= \alpha\cdot (x\cdot \alpha) \\ &= \alpha\cdot ((\alpha\cdot x')\cdot \alpha) \\ &= ((x'\cdot\alpha)\cdot (\alpha\cdot x'))\cdot ((\alpha\cdot x')\cdot \alpha) \\ &= \alpha\cdot x' \\ &= x \end{align*}$

Since $C$ is a finite set, this is sufficient to conclude that $g_\alpha$ and $g^{-1}_\alpha$ as defined are inverses.

Knowing that $|\alpha\cdot C| = |C\cdot \alpha|$, we are ready to prove that $C$ must have cardinality a perfect square, and in the process derive a "local coordinate system" for $C$ based on the element $\alpha$. To accomplish this, we can exhibit a very simple bijection between $C$ itself and the set $(\alpha\cdot C)\times (C\cdot\alpha)$, namely the mapping $x\mapsto (\alpha\cdot x, x\cdot\alpha)$ with inverse mapping $(y_1,y_2)\mapsto y_1\cdot y_2$. The fact that these mappings are inverses follows directly from the central groupoid law and the finiteness of the set $C$. This establishes immediately that $|C| = n^2$, where $n = |\alpha\cdot C| = |C\cdot\alpha|$.

We can also prove that a central groupoid $C$ of order $n^2$ has precisely $n$ idempotent elements, that is, elements $x$ satisfying $x\cdot x = x$. This property is easiest to prove using the matrix representation $M$ for the central groupoid. The rows and columns of $M$ correspond to elements of $C$ or nodes of the corresponding digraph $D$, and idempotent elements correspond to loops of the digraph, or diagonal entries of $M$. Hence the number of idempotents is equal to the trace $\text{tr}(M)$. This trace can be computed directly as the sum of the eigenvalues of $M$. Since $M^2 = J$, and $J$ has characteristic polynomial $(\lambda - n^2)\lambda^{n^2-1}$, meaning that $M$ has characteristic polynomials $(\lambda - n)\lambda^{n^2-1}$ and trace $\text{tr}(M) = n$.

The more than 3000 central groupoids of order 16

There are precisely 6 central groupoids of order 9 up to isomorphism. It isn't difficult to enumerate them by hand with a bit of patience. But if we move on to the central groupoids of order 16, the number of possibilities explodes. (As far as I'm aware, nobody in the mathematical literature has successfully enumerated the 16-element central groupoids yet, even with computer assistance.)

The problem of determining whether two binary operations on finite sets of the same cardinality are isomorphic is in general very computationally intensive. For two sets of order $N$, there are $N!$ different bijections between the two sets, each of which is a possible isomorphism to be confirmed or ruled out. However, depending on the nature of the binary operation and its properties, the problem can sometimes be simplified considerably.

In the case of central groupoids, the idempotent elements can be used to simplify the process of checking isomorphism. We know that any central groupoid with $n^2$ elements must have precisely $n$ idempotent elements. As a consequence of this fact, the idempotent elements of any central groupoid must generate the entire central groupoid, since the sub-central-groupoid generated by those elements must also be a central groupoid with at least $n$ idempotent elements (namely the generators themselves). Hence, any isomorphism between two central groupoids is completely determined by its action on the idempotent elements. That narrows down the number of bijections that need to be checked from $N!$ to $(\sqrt N)!$. The latter still grows quite fast as a function of $N$, but for smaller values of $N$, this trick can be a huge help: for example, in the case of $N=16$, it's the difference between checking $16!\approx 2.1\times 10^{13}$ bijections and only $4!=24$. Note that this also simplifies the problem of finding the symmetry group of central groupoids, and it implies that the symmetry group of a central groupoid of order $n^2$ is a subgroup of $S_n$.

I've identified 3,471 nonisomorphic central groupoids of order $4^2=16$ using an algorithm that produces novel central groupoids by making small tweaks to existing ones. Unfortunately, I'm not sure whether my method produces all possible central groupoids of order $16$, but (purely conjecturally) I believe that it does. Furthermore, my code, written in Haskell, is not formally verified. (You can check it out in this Gist if you want, or you can explore this JSON data listing the central groupoids I found and some of their properties.) So any of my comments below about the results of this number crunching can be taken with a grain of salt.

One interesting property of central groupoids to consider is the symmetry group, that is, its group of automorphisms (bijective operation-respecting functions). As mentioned earlier, any isomorphism of central groupoids of order $n^2$ is determined by its action on the $n$ idempotent elements (since they generate the whole set), so it suffices to characterise how these automorphisms permute these elements. For central groupoids of order $16$, it is pretty easy to calculate automorphism groups. For the 3,471 nonisomorphic central groupoids that I found, here's the breakdown of symmetry groups. As you can see, the overwhelming majority have no nontrivial symmetries at all, but a few exceptional ones have more interesting symmetry groups.

Automorphism group action	Number of central groupoids of order 16
Trivial	3254
$\mathbb Z_2$ swapping one pair of idempotents	136
$\mathbb Z_2$ swapping two pairs of idempotents	56
$\mathbb Z_3$	8
$S_3$	4
$V$ acting intransitively	5
$V$ acting transitively	2
$\mathbb Z_4$	4
$D_4$	1
$S_4$	1 (just the natural one)

Another interesting statistic: of the more than 3,000 central groupoids that I found, only 33 are self-dual, meaning that they are isomorphic to the central groupoid you get by reversing the order of the operation (or reversing the edges in the corresponding digraph, if you prefer). This makes self-duality a shockingly rare property.

A few more intriguing finds among the $16$-element central groupoids that I enumerated include the following:

There are 7 central groupoids for which every non-idempotent generates a sub-central-groupoid of order 4.
There is precisely 1 central groupoid for which every non-idempotent generates the entire set.
There are an additional 25 central groupoids for which all but 2 of the non-idempotent elements generate the entire set.
There are 1043 central groupoids in which no two elements give rise to the same system of "local coordinates".
There are 52 central groupoids (including the natural one) in which every element can be written as a product of 2 idempotents.
There are 30 central groupoids in which there are 13 distinct left-image-sets $\alpha\cdot C$, and none with more than 13.

go to homepage

The posts on this website are licensed under CC-by-NC 4.0.

Polynomvernichtenden Funktionen und Folgen

2025-12-03

Polynomvernichtenden Funktionen und Folgen

In diesem Blogeintrag wollte ich einfach ein Ergebnis von der Realanalyse mitteilen, die ich für sehr unintuitiv halte. Es lautet, dass es Funktionen $f:\mathbb R^+\to\mathbb R$ gibt, die alle Polynomfunktionen vernichten, also orthogonal zu jeder Polynomfunktion sind, bezüglich des $L^2$-Skalarprodukt. Das heißt, es gibt Funktionen $f:\mathbb R^+\to\mathbb R$, für die gilt $\int_0^\infty f(x) \cdot P(x) ~ dx = 0$ für jedes Polynom $P$. Anders gesagt gilt folgendes für alle $k\in\mathbb N$: $\int_0^\infty f(x)\cdot x^k ~ dx = 0$ Überraschenderweise ist es ziemlich einfach, die Formel einer bestimmten Elementarfunktion vorzulegen, die diese seltsame Eigenschaft erfüllen. Die folgende Integralformel gilt für alle $\alpha\in\mathbb C$ bei denen $\text{Re}(\alpha) > 0$: $g_n(\alpha) = \int_0^\infty x^n e^{-\alpha x} ~ dx = \frac{n!}{\alpha^{n+1}}$ Wenn man $\alpha$ die Werte von komplexen Einheitswurzeln annehmen lässt, dabei stellt sich heraus, dass besondere Linearkombinationen von verschiedenen Werten von $f_n(\alpha)$ den Wert Null annehmen für alle $n$ die einer bestimmten Teilfolge von $\mathbb N$ gehören. Zum Beispiel, jede vierte Wert von $\zeta_8^n + \zeta_8^{-n}$ ist Null, weil $\zeta_8^{8n+2}=+i$ und $\zeta_8^{-(8n+2)}= -i$. Deshalb gilt $\frac{g_{4n+1}(\zeta_8) + g_{4n+1}(\zeta_8^{-1})}{2} = \int_0^\infty x^{4n+1} e^{-x/\sqrt{2}}\cos\big(x/\sqrt{2}\big) ~ dx = 0$ Nach einem zusätzlichen Ersatz in diesem Integral, ergibt sich, dass $\int_0^\infty \frac{e^{-\sqrt[4]{x}}\cos(\sqrt[4]{x})}{\sqrt x}\cdot x^k ~ dx = 0$ Die gleiche Methode bringt noch weitere Beispielfunktionen hervor:

$\begin{align*} f(x) &= \frac{e^{-\sqrt{3} \sqrt[6]{x}}\cos(\sqrt[6]{x})}{\sqrt{x}} \\ f(x) &= \frac{e^{-(\sqrt{2}-1) \sqrt[8]{x}}\cos(\sqrt[8]{x})}{\sqrt{x}} \\ f(x) &= \frac{e^{-(2-\sqrt{3}) \sqrt[12]{x}}\cos(\sqrt[12]{x})}{\sqrt{x}} \\ \end{align*}$

Natürlich erfüllen auch lineare Kombinationen von diesen Funktionen die gleiche seltsame Eigenschaft; das heißt, die Menge von allen integrierbaren Funktionen $f:\mathbb R^+\to\mathbb R$, die orthogonal zu jedem Polynom sind, ist ein Vektorraum. Es ist nämlich der Nullraum des Operators $f ~ \mapsto ~ \bigg\langle\int_{0}^\infty f(x) \cdot x^k ~ dx\bigg\rangle_{k\in\mathbb N}$ Die Existenz solcher Beispielfunktionen ist sogar überraschender, in Anbetracht der Tatsache, dass es keine derartige Funktionen im Vektorraum von integrierbaren Funktionen auf dem Interval $[0,1]$ anstatt $[0,\infty)$ gibt, dessen Nichtexistenz eine Konzequenz des Satzes von Stone-Weierstrass ist. Wenn eine Funktion $f:[0,1]\to\mathbb R$ orthogonal zu jedem Polynom auf dem Interval $[0,1]$ wäre, dann wäre auch $\langle f, P\rangle =0$ für jedes Polynom $P$, weshalb es im Widerspruch zum Satz von Stone-Weierstrass unmöglich wäre, $f$ durch Polynomfunktionen zu approximieren.

Diesem Resultat entspricht auch ein diskretes Gegenstück. Es existieren auch unendlichen Folgen $(a_n)$ von Realzahlen, die ally Polynomfolgen $(n^k)$ vernichten bezüglich des $\ell^2$-Skalarproduktes, damit folgendes gilt für jede $k\in\mathbb N$:

$\sum_{n=1}^{\infty} a_n\cdot n^k = 0$

Diese Folgen bilden auch einen Vektorraum, der der Kernraum von einer unendlichen Vandermonde-Matrix sind, was sogar erstaunlicher ist, weil alle endliche Vandermonde-Teilmatrizen von dieser Matrix voller Rang sind, also einen nulldimensionale Kernraum besitzen:

$\begin{bmatrix}1 & 1 & 1 & 1 & \cdots \\ 1 & 2 & 3 & 4 & \cdots \\ 1 & 2^2 & 3^2 & 4^2 & \cdots \\ 1 & 2^3 & 3^3 & 4^3 & \cdots \\ \cdots & \cdots & \cdots & \cdots & \end{bmatrix}\begin{bmatrix}a_1 \\ a_2 \\ a_3 \\ a_4 \\ \cdots\end{bmatrix} = \mathbf{0}$

Es ist schwerer, die Existenz solcher Folgen zu beweisen, doch man kann sich die früher erwähnte polynomvernichtende Funktionen $f:\mathbb R^+\to\mathbb R$ zunutze machen, um Beispielfolgen zu produzieren. Sei $f$ eine bestimmte polynomvernichtende Funktion $f:\mathbb R^+\to\mathbb R$ und sei bestimmt eine komplexe Funktion $\phi_f$ durch die folgende Formel, wo $z\in \mathbb C$ die Ungleichung $\text{Re}(z) < 1$ erfüllt:

$\phi_f(z) := \int_0^\infty f(x) \cdot e^{-x(1-z)} ~ dx$

Diese Formel definiert eine analytische Funktion $\phi_f$ in der ganzen Kreisscheibe $|z| < 1$, weshalb folgt es, dass $\phi_f$ eine konvergente Potenzreihe innerhalb dieser Kreisscheibe besitzt:

$\phi_f(z) = \sum_{n=0}^\infty \frac{\phi_f^{(n)}(0)}{n!}\cdot z^n$

Durch die Formel, damit $\phi_f$ definiert wurde, kann man beweisen, dass sich sowohl $\phi_f$ als auch alle ihrer Ableitungen $\phi_f^{(k)}$ dem Nullwert annähern als $z\to 1$:

$\lim_{z\to 1}\phi_f^{(k)}(z) = k!\int_0^\infty f(x)\cdot x^k ~ dx = 0$

Das ist selbst zwar ein ziemlich ungewöhnliches Grenzverhalten von einer analytischen Funktion, doch darüber hinaus kann man daraus folgern, dass die Koeffizientenfolge $a_n = \phi_f^{(n)}(0)/n!$ zu jedem Polynom orthogonal ist, denn

$\lim_{z\to 1}\phi_f^{(k)}(z) = \sum_{n=0}^\infty a_n\cdot n(n-1)\cdots (n-k+1) = 0$

und jene Familie von Polynomen $n(n-1)\cdots (n-k+1)$ spannt den ganzen Vektorraum von Polynomfunktionen auf. Deshalb ergibt sich die folgende Formel für eine Beispielsfolge $(a_n)$ mit der Eigenart, für die wir uns interessieren:

$a_n := \frac{1}{n!}\int_0^\infty f(x) \cdot x^n e^{-x} ~ dx$

Soweit ich weiß gibt es keine schöne analytisch geschlossene Formel für eine zu jedem Polynom orthogonale Folge $(a_n)$. Die Thue-Morse Folge bekommt doch eine "ehrenvolle Erwähnung": obwohl die unendliche Summe von $\sigma(n)\cdot n^k$ gar nicht konvergent ist, trifft ihre Folgen von Partialsummen den Nullwert unendlich oft. Insbesondere gilt $\sum_{n=1}^{m\cdot 2^{k+1}} \sigma(n)\cdot n^k = 0$ für jede $m,k\in\mathbb N$, der an und für sich auch eine sehr sonderbare eigenschaft ist!

go to homepage

The posts on this website are licensed under CC-by-NC 4.0.

A vector technique for unsupervised lexeme discovery

2025-11-05

A vector technique for unsupervised lexeme discovery

Lately I've been an active contributor to Tatoeba, a huge open-source collection of parallel sentences in many different world languages. Aside from being an amazing resource for the languages I'm studying, it's given me exposure to languages that I didn't even know existed before, and it's also a great dataset for NLP projects.

I've been mulling over the following question: given a bunch of example sentences translated into your own language, would it be possible to algorithmically deduce translations for individual lexemes/morphemes? When done manually, this task is fairly simple. For instance, I don't know any Hungarian, but if I were given the following Hungarian translations of English sentences:

English	Hungarian
Tom filled the bottle with drinking water.	Tom megtöltötte az üveget ivóvízzel.
Tom drinks at least three liters of water every day.	Tom naponta legalább három liter vizet iszik.
If it weren't for water, humans wouldn't survive.	Ha nem lenne víz, az emberek nem élnék túl.
The water came up to our knees.	A víz térdig ért.
I would like some water.	Kérek egy kis vizet.

...then after staring at these sentences for a while, I would be able to guess that the word for water in Hungarian is víz without any prior knowledge. This is because the only thing in common between the English sentences is the word water, whereas for the Hungarian sentences it seems to be víz or viz (sometimes with additional prefixes/suffixes). In fact, I would also be willing to guess that ivóvízzel means drinking water.

Really, all I did was look at these sentences, find some common subsequences of characters, and make a heuristic judgment about what the most likely translation of the word water would be. This process seems like it should be susceptible to automation, so I gave it a try!

Most of the NLP tools and algorithms that I've learned about are for processing text at the word/morpheme level, and presuppose a tokenizer/lemmatizer/stemmer for the target language. This task, however, occurs at the character level and concerns how we discover lexemes/morphemes in the first place. I'm not familiar with many NLP techniques that work at this low of a level, so this problem has been a very fun challenge!

Below, I explain my approach and discuss some of its current weaknesses.

But first, a little eye candy!

Given a list of sentences in a target language with a word/phrase/substring in common between their English translations, my code calculates a kind of "heatmap" on each sentence, assigning each position in the sentence a score between 0 and 1 quantifying its local similarity to other sentences. We can visualize the results for specific sentences by graphing the scores by character index. Here's an example in Hungarian, generated when searching for a translation of the word water:

I also have a utility for visualizing this "relevance score" by highlighting segments of sentences in the target language with varying levels of saturation. Here's what this looks like for the same example in Hungarian:

Candidate words can be obtained from sentences by extracting segments containing the highest scores. By defining a custom string distance metric on the extracted strings and performing hierarchical clustering, we can also obtain a heuristic grouping of the words into clusters comprising possible lexemes. These clustered words or word forms can then be visualized as a dendrogram. Here's a dendrogram output by my code for the same example in Hungarian:

Although most of my test runs have used parallel sentences from Tatoeba, data can be ingested from any TSV-formatted file of parallel sentences. I was also able to import some data in Kannada (a Dravidian language spoken in India) from the Anuvaad Parallel Corpus and test out my algorithm on it. Here's the resulting dendrogram when I asked it to infer possible translations for the word beautiful:

Jump to the end of the post for a huge table of examples showing lexeme guesses for a few very common words in several different languages. Though my code is still pretty rough around the edges, I'm very happy with how the results are coming out so far, and I wonder if it has the potential to be developed into something more sophisticated like an unsupervized best-effort lemmatizer for languages lacking established lemmatization tools.

If you're interested, you can check out my code in a Jupyter notebook here on Github. I encourage you to play around with it! Parallel sentence data from several sample languages is included in the repo, so no additional downloads (aside from Python packages) should be necessary.

Now, here's a more in-the-weeds description of my approach.

The problem under consideration is as follows: given a bunch of sentences in language $L$ whose translations contain a certain word $w$ (or more generally, matching a certain regex), produce one or more "candidate morphemes" in the language $L$ that might serve as translations of $w$. I'm calling this problem "unsupervised" because I'm not using ground truth data (such as dictionaries in various languages) to train any sort of model to recognize words.

My first thought was to use an n-gram model to analyze common sequences of characters in example sentences. Given a set $S$ of sentences in language $L$ with translations containing the target word $w$, we could tabulate the frequencies of 2-grams or 3-grams among those sentences. Then the "hottest" substrings in each sentence could be identified as the ones containing more high-frequency n-grams on average.

I implemented this approach and it worked shockingly well for many languages. However, there was a huge drawback for languages like Arabic and Hebrew that have non-contiguous word roots. For instance, in Hebrew, the word for to read has the 3-letter root קרא, and conjugating this verb sometimes involves inserting letters in between: he reads becomes קורא, inserting the letter ו. This is a big problem for the n-grams approach, because when scoring these words in a collection of sentences, the 2-grams קר and קו would compete with each other in the 2-gram frequency count, causing different conjugations of to read to detract from each others' scores.

My immediate next thought was to use an generalized version of n-grams called "skip-grams", in which both contiguous and non-contiguous letter combinations are tabulated, e.g. there might be a frequency category not only for the substring קר, but also an additional category for occurrences of ק and ר separated by one or fewer characters, or by two or fewer characters, and so on. The problem with this approach is that the number of possible categories grows very quickly and it's not obvious what kind of scoring system should be used to take them all into account.

The idea I'm about to describe occurred to me at around midnight one night, and I ended up staying up until about 3am frantically coding up a proof-of-concept - I had to know if it would work! Vector embeddings were fresh in my mind because of a recent online course in NLP, but I had never seen a vector embedding technique applied to individual characters rather than words.

Say $C$ is the set of all characters in the language $L$. These characters might be normalized to avoid distinguishing characters that are "really the same", e.g. capitalized versus lowercase versions of the same letter, or accented versus non-accented variants, etc. Each character $c\in C$ is assigned a unit vector $\phi(v)\in \mathbb R^d$ where $d$ is the dimension of the embedding. These embeddings should either assign orthogonal vectors to different characters (in which case we must have $d\ge |C|$) or very nearly orthogonal vectors to different characters (in which case you can often do with fewer than $|C|$ dimensions). This ensures that different characters are handled independently.

Once we have a character embedding, we define a way of embedding pairs of characters separated by a certain number of indices in a string. This can be defined by a function $\psi:C^2\times {0,\cdots,\ell}\to\mathbb R^d$, where $\psi(c_1,c_2,j)$ is the embedding for $c_1$ followed by $c_2$ after $j$ characters, and $\ell$ is the "lookahead value" determining the maximum level of separation represented by the embedding. I've experimented with a few different options for this embedding, but the general idea is that $\psi(c_1,c_2, i)$ and $\psi(c_1,c_2, j)$ should be somewhat similar to each other, especially when $i,j$ are close, in order to allow embeddings of the same character $c_2$ in slightly different positions after $c_1$ to "constructively interfere" with each other. In this way, $\psi$ acts sort of like a "fuzzy" n-grams frequency table that avoids huge proliferation of frequency categories by allowing some of them to conflate with each other. Further, $\psi(c_1,c_2,i)$ and $\psi(c_1,c_3,j)$ should be orthogonal or near-orthogonal when $c_2\ne c_3$.

One option I've tried has been $\psi(c_1,c_2,j) := \cos\Big(\frac{\pi j}{2\ell+1}\Big)\phi(c_2)$ and another is the following, where $U$ is a unitary matrix that is close to the identity and $\alpha < 1$ is some constant: $\psi(c_1,c_2,j) := (\alpha U)^j \phi(c_2)$ Both of these work pretty well, but I suspect that the results can be improved by more intelligently designing the function $\psi$, and this is a detail I want to continue experimenting with.

Next, we define a function $\Psi$ such that $\Psi(i, s)$ gives an embedding combining the character pair embeddings $\psi(s[i], -, -)$ for several of the characters following $s[i]$, up to the character $s[i+\ell]$ at the lookahead threshold: $\Psi(i, s) := \sum_{j=i}^{i+\ell} \psi(s[i], s[j], j-i)$ And then, given a whole collection of sentences $S$, we define a combined embedding $\overline{\Psi}(c)$ that, intuitively speaking, summarizes the "average context" of the character $c$ in all of the places it appears in all the sentences of $S$. It is defined as follows: $\overline{\Psi}(c, S) := \frac{1}{1+\tfrac{|S|}{|\{s\in S: ~ c \in s\}|}}\cdot\sum_{s\in S}\sum_{s[j] = c} \frac{\Psi(j, s)}{\#(c, s)}$

This is an average of all of the embeddings $\Psi(j, s)$ of the positions where the character $c$ appears across all sentences, averaged across different appearances of the character $c$ in each sentence $s$. Making this an average rather than a sum is vital, both because it prevents extremely long sentences from affecting these embeddings disproportionately, and because it prevents high-frequency characters from having much larger embeddings in general. It is also multiplied by a scaling factor punishing characters that occur only in a small number of the sentences in $S$.

Finally, for each sentence $s$ in $S$, each of its characters is scored by calculating the cosine similarity of each character's local embedding in that specific sentence with its global embedding across all of the sentences in $S$. That is:

$\text{score}(i, s) = \frac{\overline{\Psi}(c, S)\cdot \Psi(i, s)}{\lVert\overline{\Psi}(c, S)\rVert\cdot \lVert\Psi(i, s)\rVert}$

When a character is followed by sequences of characters that frequently follow it in many of the sentences in $S$, then the vectors $\overline{\Psi}(c, S)$ and $\Psi(i, s)$ should point in similar directions, meaning that $\text{score}(i, s)$ should be larger.

In my scripts, I also apply some final post-processing to the character scores $\text{score}(i, s)$ for each sentence. For one, I scale and translate the scores into the interval $[0,1]$ by subtracting the minimum score and scaling by the difference between the min and max scores. I also smooth the scores across each sentence by taking a windowed average, and apply a power function such as $x\mapsto x^4$ because it accentuates the difference between higher and lower scores. This is how we get the "heatmap" highlighted sentences and graphs showcased earlier. Extracting the words occurring at the peaks of these graphs is how relevant words are extracted from sentences.

This technique still has several kinks that need to be worked out. For instance, in its current form, it does not distinguish subsequences that are common within a certain subset of sentences from subsequences that are common throughout the language as a whole. For that reason, the results of the above process often contain some irrelevant high-frequency strings corresponding to common words similar to the, a/an and I in English, for example. The same goes for the names Tom and Mary, which are extremely common in the Tatoeba corpus (to the point of being an inside joke of the Tatoeba community). Perhaps character scores could be modified by penalizing characters whose local embeddings are too similar to their global embedding in the language as a whole.

On a similar note, even if a certain word is not common in the language as a whole, it may co-occur very commonly with the target word. Consider for instance the words read/reads/reading and book. Naturally, they co-occur in a lot of the English sentences of the Tatoeba corpus, so that this technique might be likely to, say, mis-identify the Hungarian word for book as an appropriate translation of to read. I still haven't made up my mind about how to remedy this issue.

Finally, there is a key type of deduction that we use easily when manually inferring words' meanings, but my vector method does not take advantage of. Let me illustrate it with another example. Consider the following parallel sentences in English and Latvian. From these sentences, can you guess a translation for the word milk?

English	Latvian
No, I never drink coffee with milk.	Nē, es nekad nedzeru kafiju ar pienu.
Boris never confronted Rima.	Boriss nekad nestājās pretī Rimai.
Don't drink alcohol.	Nedzeriet alkoholu.
I didn't drink any coffee today.	Es šodien nedzēru kafiju.
Do you actually like your coffee with salt?	Vai jums tiešām garšo kafija ar sāli?
No, I can't.	Nē, es nevaru.

You could probably infer that pienu means milk even though it only appears in one of these sentences. This is because the remaining words in that sentence also appear in at least one of the other sentences, but milk does not appear in any of their English translations. That is, we have applied a process of elimination to deduce a translation for the word milk, which is a heuristic that my code does not (yet) attempt to use.

To sum up, the things I'd still like to improve, in brief, are:

find a way of dealing with overrepresented named entities in the Tatoeba corpus, e.g. Tom and Mary
penalize common letter combinations throughout the language
come up with a way to filter out words that commonly co-occur with a target word
incorporate an add-on that also takes into account eliminative strategies

Here's a big fat table showing my algorithm's output for a few common words in several different languages, in case you would like to get a feel for how well it works and the kinds of errors it makes. I recommend Wikitionary for looking up the meanings of these words if you want to check their definitions for accuracy.

dog

cat

book

bread

water

milk

home

day

eat

sleep

read

black

white

big

small

ber
(Berber)

aydinni
uydinni
aydi
aydia
uydi
aydinneɣ
aydinnek
aydinnes
aydiinu
weydi

amcicnni
umcicnni
amcica
amcic
umcic
amuccnni
amcicinu
imucca
yimucca
imcac

adlisnni
udlisnni
idlisen
yidlisen
adlis
adlisa
yedlisen
adlisnnes
adlisinu
udlis

aɣrum
uɣrum
weɣrum
aɣrumnni
uɣṛum
aɣrumnnes
weɣrumnni
weɣruminu
aqbur
ara

waman
wamana
aman
watay
yeḥman
amanaya
amannni
amandin
mani
ameqqran

akeffay
akeffaya
ukeffay
akeffaynni
ukeffaynni
ukeffayis
akeffaynnek
ayefki
uyefki
yefkaiyid

ɣer
ɣef
deg
seg
yedda
yebda
yella
yelli
taddart
tamaneɣt

wass
ass
assa
wussan
ussan
ussana
asmi
assnni
assnsen
yessen

isett
nsett
ttetteɣ
ttetten
setteɣ
setten
teččed
teččeḍ
iḥemmel
ikemmel

teṭṭes
yeṭṭes
neṭṭes
yeṭṭsen
yeḍḍes
teṭṭseḍ
teṭṭsed
yettaṭṭas
yelzem
yiḍes

yeqqar
yeqqard
yeɣra
yeɣrad
yeɣri
adlis
adlisa
udlis
idlisen
yidlisen

aberkan
taberkant
iberkanen
tiberkanin
krayellan
tsednan
aberqemmuc
dakken
asgainna
ayisnnek

amellal
umellal
tamellalt
imellalen
yimellalen
tmellalt
mellul
tmellalin
timellalin
mellulet

tameqqrant
tameqrant
ameqran
ameqqran
ameqṛan
timeqqranin
timeqranin
imeqranen
meqqren
aḥeqqar

amecṭuḥ
tamecṭuḥt
mecṭuḥit
mecṭuḥet
imecṭuḥen
tameẓyant
tamurt
taḥanut
teɣlust
anect

ell
(Greek)

σκύλος
σκύλους
σκύλο
σκύλου
σκυλί
σκυλιά
δύσκολο
του
σου
σκότωσε

γάτα
γάτας
γάλα
γάτες
γάτος
είναι
είσαι
φοβάται
κοιμάται
τα

βιβλίο
βιβλίου
βιβλία
τίτλος
έβαλες
βάλε
το
του
ιστορικά
ανήκει

ψωμί
ψωμιού
σκορδόψωμο
μέρα
κάνω
αυτοί
τομ
μισό
έκοψε
είναι

νερό
νερά
νερού
άερα
πίνει
πίνεις
καλύτερο
είναι
έργα
δεν

γάλα
για
υγεία
σόγιας
λίγο
αλλεργικός

σπίτι
στις
πάτε
πόδια
είναι
τεράστιου
ποια
παιδιά
στο
σπό

μέρα
ημέρα
μέσα
μέρες
ημέρες
μέχρι
χώρα
σήμερα
μια
μία

τρώνε
τρώει
να
ένα
τρώω
τομ
τον
φάω
φάε
τα

κοιμάμαι
κοιμάται
κοιμάσαι
κοιμήθηκα
κοιμήθηκαν
κοιμήθηκε
κοιμήθηκες
κοιμόταν
κοιμούνται
κοιμηθεί

διαβάζει
διαβάζεις
διαβάσει
διαβάσεις
διαβάσω
διαβάζω
διάβασα
διάβασμα
διάβαζα
διάβασε

μαύρο
μαύρος
μαύρα
μαύρη
μαύρες
τομ
του
τον
το
αγοριού

άσπρο
άσπρος
άσπρα
άσπρη
άσπρους
εκείνα
είναι
εμφανίζεται
έναν
ένας

μεγάλο
μεγάλος
μεγάλοι
μεγάλα
μεγάλη
μεγάλε
μεγάλες
μέγαλος
μεγαλύτερη
μεγαλουπόλεις

μικρός
μικρό
μικρή
μικρά
μικρού
είναι
μεσαία
μένα
ένα
ενός

hun
(Hungarian)

kutyát
kutyád
kutyám
kutyák
kutyákat
kutyámat
kutyáját
kutyánkat
kutyája
kutyánk

macskákat
macskádat
macska
macskája
macskát
macskám
macskád
macskákért
macskával
macskánk

könyvet
könyveit
könyvét
könyvei
könyved
könyvek
könyve
könyveket
könyvedet
könyveim

kenyeret
kenyérhez
kenyérre
kenyérben
kenyerünk
bundáskenyeret
kenyér
kenyérből
kent
milyen

vizet
vizem
vized
vízen
vízben
vízzel
vízhez
vízre
vízbe
vizünk

tejet
tejed
tejjel
tehenet
tejből
tej
teheneket
vajat
fejni
sajt

otthon
itthon
otthonom
hazafele
hazafelé
otthonukról
haza
házat
tom
tomi

nap
napig
napok
napot
napon
napja
napom
napod
napokra
naponta

eszem
eszel
eszik
eszi
szeretnél
szeretnék
esznek
eszünk
vettem
ettem

aludni
elaludni
aludnom
alszik
alszok
aludj
aludt
aludjunk
aludtunk
aludtam

olvastad
olvastam
olvassam
olvasni
elolvasni
olvasod
olvasok
olvasom
olvasol
elolvastam

fekete
feketébe
feketék
feketében
felhőket
koromfekete
szeretem
nekem
feketepiacról
végezte

fehér
fehérre
fehérbe
falfehér
fehérnél
fehérbor
elfehéredik
megfehéredett
festette
fordult

nagy
vagy
nagyon
nagyok
mary
vagyok
egy
nagyvárosban
nagyvárosok
hogyan

kicsi
kocsim
kicsiben
kisvárosban
kisvárosból
kis
cicije
kisbicskát
szókincsed
kilátást

hye
(Armenian)

շունը։
շունը
շունդ
շունն
շուն
անունը
շանը
շանը։
շան։
ունի

կատուն
կատուն։
կատու։
կատուս
կատու
կատուները
կատուները։
կատուներ
կատուների
կատվին։

գիրքը։
գիրքը
գրքեր
գրքերը
գրքերն
գրքեր։
գիրքն
գիրք
գրել
գրքում։

հաց
հացը
հաց։
հացն
հացը։
գնեցի։
գնեց։
գնելիս։
առավ։
պատվիրեցի։

ջուր
ջուրը
ջուր։
ջուրը։
մաքուր
նոր
ունի
ջրով
խմում։
ու

կաթը
կաթ
կաթի
կաթ։
կաթը։
կատուն
խմել։
խմել
եմ
են

տուն
տուն։
տանն
տա՞նն
տանը
տանը։
տան
շուտ
տանել։
յաննին

երեկ
երեք
ամեն
մենք
մերին
երբեք
այն
տանն
համար
նրան։

ուտում։
ուտու՞մ։
ուտում
ուտո՞ւմ
ուտու՞մ
ուզում
ուտել։
ուտես։
ուտելու
ուտելու։

քնում։
քնում
քնել։
քնեք։
քնելը
քնեց։
քնո՞ւմ
քնել
քնեցի
քնեցի։

կարդացել
կարդացե՞լ
կարդում
կարդում։
կարդո՞ւմ
կարդացել։
կարդալ
կարդալ։
կարդա։
կարդաց

սև
սա
ես
այս
են։
եք։
ամեն
ամպերով։
մեքենան
նա

սպիտակ
սպիտակ։
պատերը
պատը
տունը։
սա
առյուծը
է։

մեծ
մե՞ծ
մեծ։
մենք
ամեն
է։
չէ։
են։
եմ
աչքեր

փոքր
փոքրիկ
բնակարանը
բառարանը
է։
էր։
որքա՞ն
մեր
երկիր
էր

ind
(Indonesian)

anjing
anjingku
anjingmu
anjingnya
ingin
jangan
anaknya
anggur
siang
makanan

kucing
kucingku
kucingmu
kucingnya
bukan
makan
temukan
ikan
menyukai
ini

buku
bukuku
bukumu
bukan
bukunya
suka
baru
bukubuku
aku
kesukaanmu

roti
rotinya
dari
tom
itu
turun
wanita
memberikan
air
mentega

airnya
air
dari
hari
ada
udara
mandi
mineral
sendiri
pantai

susu
susunya
sudah
sebelum
nasi
sapi
dua
setiap
dari
di

rumah
kerumah
rumahmu
rumahku
rumahnya
sebuah
hujan
bukan
apakah
pulang

hari
sehari
harimu
hasil
harga
nasi
hampir
seharian
harihari
kemarin

makan
akan
makanan
memakan
dimakan
maukah
malam
ikan
mana
kacang

tidur
tertidur
tidurlah
tidak
ribut
yaitu
menidurkanku
badak
dua
itu

membaca
membacakan
dibaca
beberapa
baca
dibacanya
majalah
sebuah
bukunya
padaku

hitam
kita
wanita
minum
tanpa
itu
melihat
pakaian
tikus
putih

putih
seputih
batubatu
hitam
salju
itu
ini

besar
sebesar
gambar
sejajar
semua
seluas
osaka
sebuah
sebelum
terkadang

kecil
memiliki
sempit
lakilaki
mencarikan
terlalu
kita
tetapi
tinggal
ini

isl
(Icelandic)

hundinn
hundurinn
hundinum
hundanna
hundarnir
hundur
kötturinn
hundasýningu
eigandinn
hundar

kötturinn
köttinn
köttur
hundurinn
maðurinn
kettir
ketti
kattar
kettinum
kött

bókina
bókin
bókinni
bók
bóka
bókarinnar
bækurnar
bækur
tekur
kemur

brauð
brauðbita
borðarðu
borðaði
borða
að
með
er

vatn
vatns
vatni
vatnið
vatninu
vatnsglas
kranavatn
vertu
flöskunni
fötunni

mjólk
mjólkar

heima
heim
heiman
heimilið
eins
heimabæinn
til
mig
minnir
er

daginn
dagurinn
dagsins
dag
daga
dagana
enginn
degi
segir
lengi

borðar
borða
borðað
borðaði
borðum
borðarðu
orðin
borðaðirðu
brauð
að

sofa
sofið
svefni
svefns
sofandi
sofnaði
svefn
svafst
svaf
hafa

lesa
lesið
þessa
lestu
skáldsöguna
skáldsögu
elska
lestur
enska
þessar

svartir
svartur
svört
svart
svörtu
svörtum
kolsvart
svartklædd
var
stór

hvítar
hvíta
hvítt
hvít
hvítur
hvítklædda
hvað
þetta
hvítvínsglas
eða

stórt
stór
stóra
stóri
er
ert
eru
stóran
stórir
en

lítill
lítil
lítið
litlum
litlir
litla
hluti
með lítið
bill
leit

kan
(Kannada)

ಪ್ರಾಣಿಗಳ
ಪ್ರಾಣಿಗಳು
ಇಲ್ಲಿವೆ
ಇಲ್ಲಿಗೆ
ಮಾತ್ರವಲ್ಲದೆ
ಮಾತ್ರವಲ್ಲದೇ
ಇಲ್ಲಿ
ಇಲ್ಲಿನ
ಪ್ರಾಣಿಗಳಾದ
ಕತ್ತೆ

ಚಿರತೆಗಳು
ಚಿರತೆಗಳ
ಕಾಡು
ಕಂಡು
ಬೆಕ್ಕು
ಬೆಕ್ಕಿನ
ಶ್ರೇಣಿಗಳನ್ನು
ಪ್ರಾಣಿಗಳನ್ನು
ಕಾಣಬಹುದು
ಕಾಣಬಹದು

ಪುಸ್ತಕಗಳು
ಪುಸ್ತಕಗಳ
ಪುಸ್ತಕಗಳನ್ನು
ಪುಸ್ತಕವನ್ನು
ಪ್ರವಾಸಿಗರು
ಪ್ರವಾಸಿಗರಿಗೆ
ಮತ್ತು
ವಸ್ತು
ಎತ್ತರ
ಪುಸ್ತಕಗಳಿವೆ

ಹಾಗು
ಪ್ರಶಾಂತ

ಮತ್ತು
ಮುತ್ತು
ಮತ್ತೆ
ಹೊತ್ತು
ಪ್ರವಾಸಿಗರ
ಪ್ರವಾಸಿಗರು
ಮತ್ತೊಂದು
ಮರೆತು
ಸುತ್ತಲು
ನೀರಿನ

ಮಾಡಿಸಲಾಗುತ್ತದೆ
ಮಾಡಲಾಗುತ್ತದೆ
ನೀಡಲಾಗುತ್ತದೆ
ನಂಬಲಾಗುತ್ತದೆ
ಪೂಜಿಸಲಾಗುತ್ತದೆ
ಹಾಲನ್ನು
ಹೆಸರನ್ನು
ಮತ್ತು
ಭಕ್ತರು
ಮಾತ್ರ

ಅಳಿವನಂಚಿನಲ್ಲಿರುವ
ಅಳಿವಿನಂಚಿನಲ್ಲಿರುವ
ಮತ್ತು
ಮುತ್ತ
ಪಕ್ಷಿಗಳಿವೆ
ಪಕ್ಷಿಗಳಿಗೆ
ಕತ್ತೆ
ಪ್ರಾಣಿಗಳ
ಪ್ರಾಣಿಗಳು
ಮನೆಯಾಗಿದೆ

ದಿನಗಳಲ್ಲೂ
ದಿನಗಳಲ್ಲಿ
ಬೆಳಗ್ಗೆ
ಬೆಳಿಗ್ಗೆ
ಆಚರಿಸಲಾಗುತ್ತದೆ
ನೆರವೇರಿಸಲಾಗುತ್ತದೆ
ತೆರೆದಿರುತ್ತಿದ್ದು
ತೆರೆದಿರುತ್ತದೆ
ಮತ್ತು
ಮತ್ತೆ

ಸೇವಿಸುತ್ತಾರೆ
ಸಲ್ಲಿಸುತ್ತಾರೆ
ಆಹಾರಗಳನ್ನು
ಆಹಾರವನ್ನು
ಪ್ರವಾಸಿಗರಿಗೆ
ಪ್ರವಾಸಿಗರೂ
ಕೊಲ್ಲುತ್ತಾರೆ
ಮಾಡಬಹುದು
ಮಾಡುವುದು
ತಿನ್ನುತ್ತಾರೆ

ಇಲ್ಲಿ
ರಲ್ಲಿ
ಇಲ್ಲಿಗೆ
ಮಲಗಿರುವ
ಮಲಗಿರುವಂತಹ
ಒದಗಿಸುತ್ತದೆ
ತೋರಿಸುತ್ತದೆ
ಮಲಗುವ
ಎಲ್ಲಾ
ಎಲ್ಲರ

ಸೂರ್ಯಾಸ್ತಮಾನವನ್ನು
ಸೂರ್ಯಸ್ನಾನವನ್ನು
ಇಲ್ಲಿನ
ಇಲ್ಲಿಯ
ಇಲ್ಲಿ
ಸ್ವಾಗತವನ್ನು
ಕ್ರಾಂತಿಯನ್ನೇ
ನಲ್ಲಿ
ಶಾಸನವೊಂದನ್ನು
ಹೆಸರುಗಳನ್ನು

ಕಪ್ಪು
ಕೆಂಪು
ಕಟ್ಟು
ಕಪ್ಪುಕರಡಿ
ಇಲ್ಲಿ
ಇಲ್ಲಿನ
ಮತ್ತು
ಮತ್ತೊಂದು
ರಫ್ತು
ಬೆಕ್ಕು

ಬಿಳಿ
ಬಿಳಿಯ
ನಿರ್ಮಿಸಲಾಗಿದೆ
ನಿರ್ಮಿಸಲಾಗಿರುವ
ಮತ್ತು
ಮತ್ತೊಂದು
ವಸ್ತು
ನಿರ್ಮಿಸಲ್ಪಟ್ಟಿದೆ
ಪ್ರವಾಸಿಗರು
ಪ್ರವಾಸಿಗರನ್ನು

ದೊಡ್ಡ
ದೊಡ್ದ
ಪ್ರವಾಸಿಗರ
ಪ್ರವಾಸಿಗರು
ಇಲ್ಲಿ
ಇಲ್ಲಿನ
ಇಲ್ಲಿದೆ
ಇಲ್ಲಿಗೆ
ನಲ್ಲಿ
ದೊಡ್ಡದಾದ

ಸ್ಥಳದಲ್ಲಿರುವ
ಸಮೀಪದಲ್ಲಿರುವ
ಇಲ್ಲಿವೆ
ಇಲ್ಲಿಗೆ
ಅಲೀಗಢದಲ್ಲಿರುವ
ದೂರದಲ್ಲಿರುವ
ಪ್ರವಾಸಿಗರು
ಪ್ರವಾಸಿ
ರಸ್ತೆಯಲ್ಲಿರುವ
ಜಿಲ್ಲೆಯಲ್ಲಿರುವ

kat
(Georgian)

ძაღლია
ძაღლი
ძაღლის
ძაღლს
ძაღლები
ძალიან
ძაღლთან
აი
არ

კატა
კატები
არის

წიგნი
წიგნის
წიგნია
წიგნში
წიგნს
წიგნებია
წიგნები
წიგნების
წიგნმა
ისინი

პური
პურს
ვჭამ
ჭამს
მაქვს
ვიყიდე

წყალს
წყალი
წყლის

რძე
რძეს
რძისგან
სახლისკენ
მე

სახლში
სახლშია
სახლი
სახლიდან
სახლისკენ
ახლა
ლეილას
დარჩით
ისინი
დაბნელებამდე

დღე
დღეს
დღეა
დღეში
დღის
ყოველდღე
რამდენ
მე
ბარდება
ეს

ჭამს
ჭამას
ვჭამ
ვჭამთ
ჭამა
გიჭამია
გვიჭამია
მიირთვა
მიირთვი
დესერტი

მძინავს
სძინავს
გძინავს
დაეძინა
დაიძინა
ძინავთ
გვეძინა
ეძინათ
მეძინა
დასაძინებლად

კითხულობენ
კითხულობს
ვკითხულობ
წაიკითხა
წავიკითხავ
კითხვა

ძაღლი
შავია

არის
თეთრი

დიდი
სახლი
ის

პატარა
მდინარის
მახლობლად
სახლში
ტომი

lit
(Lithuanian)

šunis
šunys
šuns
šunį
šunų
šunims
šuo
šuniui
nusipirkau
nusipirkti

katės
katė
katę
kates
katinai
katinas
katei
kėdės
katiną
kam

knygą
knyga
knygų
knygas
knygos
knygoje
mokiniai
naudinga
laikai
yra

duoną
duona
duonos
duok
parduoda
kurią
nori
nuo
pikto
žinau

vandens
vandenį
vandeniu
vanduo
vienas
sunkesnis
sūresnis
daviau
kareiviai
negalėtume

pieno
pieną
pienu
pienas
geria
geriu
neduoda
išgerti
nori
palaukti

namo
namų
namie
namai
mano
esame
neeiname
neturite
mane
taip

dienų
dieną
diena
dienas
viena
dienos
dienoms
dienom
dirba
kasdien

valgyti
valgti
pavalgyti
valgėte
valgei
valgom
valgo
suvaglyti
nevalgo
nevalgė

miegoti
pamiegoti
miegojai
miegojo
miega
miego
miegu
miegi
miegojau
miegantį

skaityti
perskaityti
skaitai
skaityk
perskaitysi
perskatyti
skaitoma
neskaityk
skaitau
perskaitysiu

juodas
juoda
juodai
juodų
juodo
juodą
juodus
juodos
juokiasi
lova

balta
baltas
baltą
balto
matau
pabalo

didelis
didelias
didelių
dideli
didelė
didelį
viena

mažas
mažos
maža
mažame
matai
namas
maži
mažą
mažoje
labai

lvs
(Latvian)

suni
suns
sunim
suņi
mans
suņu
mani
manu
suņiem
sāka

kaķis
kaķus
kaķim
kaķi
kaķa
mazākais
raibais
tikai
kaklu
vairāk

grāmata
grāmatu
grāmatas
grāmatām
smaga
man
tava
ir
tā

maizi
maizes
rupjmaizi
kvass
esi
kas
ar
ir

ūdens
ūdeni
ūdenī
ūdenim
ūdenstilpē
minerālūdens
iedevu
gruntsūdeņus
putni
nedzeru

pienu
piena
piens
rūgušpienu
priekšroku
reta
nedzer
pazīstama
un
ir

mājās
mājām
atstājis
atstāja
nerunājam
aizmirsa
sekoja
savu
angliski
ej

dienu
diena
dienā
vienu
dienas
dienai
ēdiena
dienās
viena
dienām

ēstu
ēst
ēd
ēdu
mēs
ēdīs
ēdīsi
ēdis
neēdu
ēdīšu

gulēju
gulēšu
gulēja
gulēji
gulēt
pagulēt
gulēsi
guļot
guļ
guļu

lasīt
lasītu
lasot
grāmatas
grāmata
lasīja
lasīju
izlasītu
lasījis
lasu

melns
melnas
melnos
melnās
melna
melnā
melnu
melnai
melnie
melnajā

balts
baltu
baltā
balto
baltais
bars
melnas
bikses
tas
straumi

liela
lielu
lieli
lielā
liels
lielās
lielām
saule
saules
tai

mazas
maza
mazu
mazs
mazā
mana
manam
maziem
marija
redzama

mkd
(Macedonian)

кучето
кучево
кучиња
кучињата
куче
кучка
куки
кое
очекуваше
чуваше

мачкава
мачката
мачките
мачка
мачки
сака
сакам
кучиња
имам
таа

книгата
книгава
книги
книга
книгите
читаш
дека
магии
премногу
на

леб
леп
лебот
треба
ли
е
со
во
од
на

водата
водава
вода
додај
додека
воденица
навадам
доволна
создаде
да

млеко
млекото
млеково
мене
козјо
пиеме
смееме
колку
ако
може

дома
дом
мама
додека
том
има
домашните
одам
одиме
да

ден
дена
еден
денес
денов
денот
дедо
арен
две
дневно

јадам
јадат
јадеш
јаде
јадел
јадеме
јадеше
јади
јадење
јадено

спијат
спијам
спие
спиев
спиеш
спиел
спиеле
спиење
спиеше
спиј

прочитам
прочита
прочиташ
прочитал
прочитав
читам
чита
читаш
читал
читаше

црни
црно
црна
црн
црниот
том
тоа
црнец
црната
црнокос

црнобели
црнобело
бели
белци
бела
бело
бел
белата
белиот
врело

голема
големо
големи
голем
том
тоа
поента
помогна
главното
главната

мала
мали
мало
мал
малата
табла
премала
премали
премало
малечка

nob
(Norwegian Bokmal)

hunden
hundene
hunder
under
hund
hun
rundt
nesten
sovende
hans

katten
kattene
katter
katt
klatre
hater
etter
svarte
kanskje
elsker

bøker
bøger
bøkene
boken
bokas
boka
bok
noen
ønsker
denne

brød
brødet
dere
denne
drar
ludder
allerede
de
er
egentlig

vann
vanne
vannet
mannen
vanndamp
enn
var
renne
varer
plantene

melk
melken
melkekyr
melkeallergi
melkeproduksjon
drikker
i

hjem
hjemme
jeg
hele
komme
kommer
hvis
rette
der
deg

dagen
dager
dag
deg
ganger
klager
dagboken
leger
lang
gang

spiser
spise
spiste
spises
spist
spis
pisa
disse
spisesalen
pizza

sovet
sover
sove
sovende
hver
sov
ideer
søvn
ligger
som

lese
leser
leste
lest
eller
hele
eventyr
allerede
denne
disse

svart
svarte
sort
var
hvit
hatt
har
katten
hesten
en

hvit
hvite
hvitt
har
var
hest
katten
kanter
svart
vi

stort
stor
store
svart
som
etter
sett
et
er
en

liten
lite
litt
lille
gutten
den
en
enn
kvinnen
enden

ron
(Romanian)

câinele
câinelui
câine
câini
caine
câinilor
cine
câinii
inventat
nevoie

pisica
pisică
pisici
pisicii
pisicile
pipăit
scăpat
trecea
petrece
mănâncă

carte
cartea
aceasta
această
care
cărți
cărții
cărui
foarte
cărțile

pâinea
pâine
taie
pe
proaspătă
în
ai

apă
apa
apei
apus
proaspăt
piatră
puțină
luxoasă
pe
era

laptele
lapte
poate
alerga
turnat
ea
el
a

acasă
casă
casa
școală
astăzi
șase
acum
tatăl
meargă
rămas

zi
azi
duminică
duminica
zile
zilele
săptămâna
săptămânii
ai
fi

mănânci
mănânc
mănânce
mănâncă
mănâncăți
mânca
mâncat
mâncați
mâncăm
mâncare

doarme
doarmă
dormi
dormit
dormind
dormea
dorm
adorm
dormeau
dormeam

citit
citite
citito
citești
citește
citim
citesc
citească
citi
cartea

negru
negrul
negre
negri
neagră
afară
mereu
fiecare
erau
grup

alb
albă
albe
albi
ale
sau
astăzi
lebedele
umple
ca

mare
mari
are
marile
țară
tale
foarte
mărire
favoare
gaura

mică
mici
mic
este
ești
asta
există
acesta
acest
camera

go to homepage

The posts on this website are licensed under CC-by-NC 4.0.

Interessante Eigenartigkeiten der russischen Grammatik

2025-09-01

Interessante Eigenartigkeiten der russischen Grammatik

Vor fast neun Monaten find ich an, Russisch selbst zu studieren. Anfänglich hatte ich nur vor, das russisches Alphabet und vielleicht ein paar Wörter/Namen zu lernen. Aber nachdem ich anfing, hat es nie aufgehört, sehr interessant zu sein, deswegen habe ich noch nicht damit aufgehört! Letztens bin ich bei Kapitel 22 im Buch The New Penguin Russian Course angekommen.

In diesem Blogeintrag wollte ich einige bestimmte Eigenartigkeiten von der russischen Sprache (besonders im Vergleich zu Englisch, Spanisch und Deutsch) auflisten und besprechen, die mir am interessantesten vorkamen.

Null-Kopula

In der Grammatik wird eine Kopula mehr oder weniger als ein Wort (normalerweise ein Verb) definiert, dass das Subjekt eines Sätzes mit einem anderen Substantiv identifiziert, auf eine Kategorie beschränkt, usw. (Der Ausdruck, mit dem den Subjekt identifiziert wird, heißt der "Subjektkomplement".) Zum Beispiel ist das Verb "sein" die hauptsächliche deutsche Kopula und ähnlicherweise das Verb "to be" auf Englisch.

Russisch benutzt doch eine Null-Kopula, das heißt, mann kann das Subjekt eines Sätzes mit einem anderen Substantiv oder Adjektiv ohne weiteres Wort verbinden. Hier unten gibt es ein paar Beispiele:

Satz auf Russisch	Satz auf Deutsch
Я не англичанин.	Ich bin kein Engländer.
Он танцовщик.	Er ist ein Tänzer.
Том — мормон.	Tom ist ein Mormone.

Manchmal benutzt man den Geviertstrich an Stelle von der Kopula, wie man hier oben im dritten Beispiel sieht. Es ist nicht der Fall, dass es kein Kopulaverb in Russisch gibt, sondern dass es einfach beim Präsens ausgelassen wird. Der Kopulaverb heißt быть und es taucht manchmal in seiner Infinitivform und in seiner Vergangenheitsform auf. Zwar hat es auch eine Präsensform aber diese hat das Futur zu bedeuten. In den meisten Fällen, wo быть auftaucht, muss den Subjektkomplement in den Instrumentalkasus gebeugt werden.

Satz auf Russisch	Satz auf Deutsch
Хочу быть инженером.	Ich will ein Ingenieur sein.
Он был ленивым.	Er war faul.
Она будет учительницей.	Sie wird eine Lehrerin.

Relativ kompliziertes Kardinalzahlensystem

"Relativ kompliziert" ist eine riesige Untertreibung - ich hätte nie ahnen können, dass die Kardinalzahlen eine der kompliziertesten Aspekte der russischen Grammatik ist. Für diejenigen, die den Ausdruck "Kardinalzahl" nicht kennen: das verweist sich auf Zahlwörter, die eine Anzahl an etwas bestimmt. (Auf Deutsch zum Beispiel: "ein Apfel", "zwei Äpfel", "drei Äpfel" usw.) Das steht im Gegensatz zu Ordinalzahlen, die Position von einem Gegenstand in einer Folge. (Auf Deutsch: "erstes Kapitel", "zweites Kapitel", "drittes Kapitel" usw.)

Die Kardinalzahlwörter selbst sind nicht so kompliziert, mindestens nicht komplizierter als in anderen europäischen Sprachen. Hier unten sind einige Beispiele:

Zahlwort auf Russisch	Zahl
один	1
два	2
три	3
четыре	4
пять	5
шесть	6
семь	7
восемь	8
девять	9
десять	10
одиннадцать	11
двенадцать	12
тринадцать	13
четырнадцать	14
пятнадцать	15
шестнадцать	16
семнадцать	17
восемнадцать	18
девятнадцать	19
двадцать	20
двадцать один	21
двадцать два	22
двадцать три	23

Was komplizierter ist, ist die Beugung des Wortes der Gegenstände, die zusammengezählt sind. Für die Zahl 1 und irgendeine Zahl (abgesehen von 11), dessen letzte Ziffer 1 ist, wird das Wort singular und im nominativen Kasus gebeugt. Für die Zahlen 2, 3 und 4 und irgendeine Zahl (abgesehen von 12, 13, 14), dessen letzte Ziffer eine von diesen Zahlen ist, wird das Wort auch singular aber im genitiven Kasus gebeugt. Schließlich werden im Fall von anderen Zahlen das Wort plural und im genitiven Kasus gebeugt. Also, zum Beispiel:

Zahlwort auf Russisch	Zahl	Kasus	Anzahl an Äpfel
один	1	nom. sg.	одно яблоко
два	2	gen. sg.	два яблока
три	3	gen. sg.	три яблока
четыре	4	gen. sg.	четыре яблока
пять	5	gen. pl.	пять яблок
шесть	6	gen. pl.	шесть яблок
семь	7	gen. pl.	семь яблок
восемь	8	gen. pl.	восемь яблок
девять	9	gen. pl.	девять яблок
десять	10	gen. pl.	десять яблок
одиннадцать	11	gen. pl.	одиннадцать яблок
двенадцать	12	gen. pl.	двенадцать яблок
тринадцать	13	gen. pl.	тринадцать яблок
четырнадцать	14	gen. pl.	четырнадцать яблок
пятнадцать	15	gen. pl.	пятнадцать яблок
шестнадцать	16	gen. pl.	шестнадцать яблок
семнадцать	17	gen. pl.	семнадцать яблок
восемнадцать	18	gen. pl.	восемнадцать яблок
девятнадцать	19	gen. pl.	девятнадцать яблок
двадцать	20	gen. pl.	двадцать яблок
двадцать один	21	nom. sg.	двадцать одно яблоко
двадцать два	22	gen. sg.	двадцать два яблока
двадцать три	23	gen. sg.	двадцать три яблока
двадцать четыре	24	gen. sg.	двадцать четыре яблока
двадцать пять	25	gen. pl.	двадцать пять яблок

All das geht nur für Kardinalzahlen, die eine Nominalphrase im nominativen oder im akkusativen Kasus bilden. Wenn diese Nominalphrase in einem der anderen vier Kasus ist, dann wird sowohl die Kardinalzahl als auch das Substantiv in diesem bestimmten Kasus gebeugt. Natürlich haben die Zahlwörter selbst auch viele irregulare Kasusdeklinationen:

Zahl	nom.	gen.	dat.	inst.	prep.
2	два	двух	двум	двумя	двух
3	три	трёх	трём	тремя	трёх
4	четыре	четырёх	четырём	четырмья	четырёх
5	пять	пяти	пяти	пятью	пяти
6	шесть	шести	шести	шестью	шести
40	сорок	сорока	сорока	сорока	сорока
50	пятьдесят	пятидесяти	пятидесяти	пятьюдесятью	пятидесяти
90	девяносто	девяноста	девяноста	девяноста	девяноста
100	сто	ста	ста	ста	ста
200	двести	двухсот	двумстам	двумястами	двухстах
300	триста	трёхсот	трёмстам	тремястами	трёхстах

Für etwas so einfach und grundsätzlich wie die Aufzählung von Gegenständen ist das lächerlich kompliziert! Auf Deutsch sowohl als auf Englisch und Spanisch muss man meistens nur ein Kardinalzahlwort und ein Gegenstandswort nebeneinandersetzen.

Perfektive und imperfektive Verben

Viele Sprachen machen einen grammatischen Unterschied zwischen Ereignissen, die innerhalb eines bestimmten Zeitraums vollendet sind, und Ereignissen, die unvollendet sind. Dieser Unterschied heißt (linguistischer) Aspekt. Im ersten Fall nennt man den Aspekt perfektiv und im zweiten Fall nennt man den Aspekt imperfektiv. Folgendes ist eine Tabelle von Beispielen auf Deutsch:

Satz auf Deutsch	Satz auf Englisch	Aspekt des schräggedruckten Verbs	Grund
Ich trank ein Glas Milch.	I drank a glass of milk.	perfektiv	bestimmte Menge an Milch wird ausgetrunken
Sami trank immer Milch.	Sami always drank milk.	imperfektiv	gewöhnliche Aktion, die wiederholt wird
Kinder trinken Milch.	Kids drink milk.	imperfektiv	allgemeine Aktion anstatt bestimmtes Ereignisses
Ich werde ein Bier trinken.	I will drink a beer.	perfektiv	bestimmte Menge ist ausgetrunken worden
Trink deine Milch.	Drink up your milk.	perfektiv	es wird befohlen, dass die Milch völlig ausgetrunken wird
Du musst viel Milch trinken; dann wirst du groß und stark.	You must drink lots of milk; then you'll get big and strong.	imperfektiv	allgemeine Empfehlung für die Zukunft

In den Sprachen, die ich bis jetzt kann (Englisch, Spanisch und Deutsch) unterscheidet man perfektiven von imperfektiven Verbverwendungen durch unterschiedlichen Beugungen oder zusätzlichen Wörter (z.B. immer, jeden Tag, oft, manchmal usw). Dagegen ist jedes Verb in Russisch entweder perfektiv oder imperfektiv, das heißt, der Aspekte hängt hauptsächlich nur von dem Verb selbst ab. Jedem imperfektiven Verb entspricht ein (aber manchmal mehrere) perfektives Verb. Der Aspekt beeinflusst manchmal sogar die Bedeutung eines Verbs. In der Tabelle hier unten finden sich einige Beispiele:

imperfektives Verb	Bedeutung	perfektives Verb	Bedeutung
говорить	reden/sagen	сказать	(etwas) sagen
говорить	reden/sagen	поговорить	eine Zeit lang reden
знать	wissen	узнать	herausfinden/erfahren
пить	trinken	выпить	austrinken
жить	wohnen	пожить	eine Zeit lang (in einem Ort) wohnen
звать	nennen	позвать	herbeirufen
учить	studieren	выучить	beherrschen (ein Fachgebiet/eine Sprache)

Das Beispiel садиться/сесть "sich setzen" ist auch besonders interessant, weil das imperfektive Gegenstück ein reflexives Verb ist (ausweislich des Suffix -ся) und das perfektive Gegenstück andrerseits kein reflexives Verb ist!

Mithilfe dieser Einteilung von Verben kann man eine Vielfalt an zeitlichen Bedeutungen mit nur wenigen Zeitformen äußern. Zum Beispiel, die Gegenwartsform der Verben im Russischen spielen eine Doppelrolle: bei den imperfektiven Verben äußert sie eine Aktion, die im Gang ist oder einen gegenwärtigen Zustand, aber bei den perfektiven Verben bedeutet sie eine künftige Aktion. So kann eine einzige Verbform für zwei Tempus (Gegenwart und Zukunft) angewendet werden!

Allerdings ist diese Eigenschaft der russischen Grammatik wahrscheinlich die, die meine Denkart am meisten verändert, wenn ich etwas auf Russisch zu sprechen oder zu schreiben versuche.

Vielfalt an Bewegungsverben

Das Thema in der russischen Grammatik, das sich bis jetzt als das schwierigste für mich herausgestellt hat, betrifft die Bewegungsverben (gehen, fahren, rennen, besuchen, bringen usw). Es gibt sehr viele Bewegungsverben im Russischen zu erinnern und zu allem Unglück sehen einige Paare von Bewegungsverben sehr ähnlich aus, was erschwert für mich die Memorierung.

Im Allgemeinen ist es im Russischen in vielen Fällen unmöglich, zusätzliche Informationen über eine Bewegung aus einem Satz auszulassen, die man in der Muttersprache angewöhnt ist, nicht spezifizieren zu müssen. Zum Beispiel, das Verb gehen auf Deutsch könnte sich sowohl einen Spaziergang als auch eine Autofahrt beschreiben, und von Natur aus bestimmt es nicht, ob diese Einwegfahrt oder Rundfahrt ist. Im Russischen gibt es ein bestimmtes Verb für jeden Fall aber kein allgemeines Verb, die diese Einzelheiten verbergt.

Bewegungsverb auf Russisch	Bedeutung
ходить	gehen, zu Fuß, hin und zurück oder in vielen Richtungen
идти	gehen, zu Fuß, in einer Richtung
ездить	gehen, mit dem Auto, hin und zurück oder in vielen Richtungen
ехать	gehen, mit dem Auto, in einer Richtung
водить	(jemanden) bringen, zu Fuß, hin und zurück
вести	(jemanden) bringen, zu Fuß, in einer Richtung
возить	(jemanden/etwas) bringen, mit dem Auto, hin und zurück
везти	(jemanden/etwas) bringen, mit dem Auto, in einer Richtung

Zudem sind diese nur imperfektive Verben. Jedem Verb von diesen entspricht auch ein (manchmal zwei oder mehr!) perfektives Gegenstück. Da die Wahl des richtigen Verbs von so vielen Parametern abhängt ist es sogar schwierig, eine gute Tabelle davon zu entwerfen - man braucht wohl einen ganzen Hyperwürfel:

Ähnlichkeiten mit Englisch, Spanisch und Deutsch

Zum Schluss wollte ich durch ein paar Listen zusammenfassen, was mir bis jetzt beim Russischlernen am schwierigsten ist und was mir am einfachsten ist.

Mir sind die einfachste Seiten (oder die Seiten, die einfacher als ich erwartete waren) der russischen Grammatik bis jetzt:

keine Artikel erinnern oder deklinieren zu müssen
die Vergangenheitsform des Verbs
Konditionalis und Irrealis zu bilden

und die schwierigste Seiten:

Pluralsubstantive im genitiven Kasus zu beugen
Kardinalzahlen (zum Teil weil es auf pluralen genitiven Deklinationen beruht)
zwischen den Präpositionen в und на je nach dem Ort entscheiden
die Position des Tons in einem Wort zu erinnern
umgangsprachliche russischen Partikeln (e.g. и, а, ну, да, -ка, -то) zu verstehen und zu benutzen

Es ist auch manchmal ziemlich schwer im Allegemeinen, viele ähnlich aussehende Wörter nicht zu verwechseln. Dazu gibt es ein Meme:

Da dieser Blogeintrag schon so vollgepackt mit Tabellen ist, warum nicht eine Tabelle mehr? Ich habe einige Spracheigenschaften in der folgenden Tabelle zusammengestellt, um Englisch, Spanisch, Deutsch und Russisch unter sich zu vergleichen:

Spracheigenschaft	Englisch	Spanisch	Deutsch	Russisch
Pro-Drop		✓		✓
2P sing. wird 2P pl. in formeller Sprache		✓	✓	✓
hat Artikel	✓	✓	✓
grammatisches Geschlecht für Substantiven		✓	✓	✓
doppelte Verneinung ist Verneinung		✓		✓
hat Kasussystem			✓	✓
Verbenbeugungen sind geschlechtsabhängig				✓
hat grammatische Belebtheit		✓		✓
viele aussagekräftigen Verbpräfixen			✓	✓
Perfekt kann durch Hilfsverb ausgedrückt werden	✓	✓	✓
Futur kann durch Hilfsverb ausgedrückt werden	✓	✓	✓	✓

go to homepage

The posts on this website are licensed under CC-by-NC 4.0.

Eine knappe Erklärung von der Kosinus-Formel des Skalarprodukts

2025-08-15

Eine knappe Erklärung von der Kosinus-Formel des Skalarprodukts

Das Skalarprodukt (auf Englisch: dot product) ist eine elementare mathematische Verknüpfung, die einem Paar von Vektoren eine Skalarzahl zuordnet. Ich glaube, ich habe davon zum ersten Mal ausdrücklich erfahren, als ich den Kurs namens "Calculus 3" in meiner Uni belegte. Es ist sehr einfach zu rechnen: das Skalarprodukt ist die Summe der Produkte der beziehungsweisen Koordinaten der Vektoren.

$\mathbf{x}\cdot\mathbf{y} = x_1 y_1 + x_2 y_2 + \cdots + x_n y_n$

Es ist doch häufig eingesetzt, um den Wert des Winkels $\vartheta$ zu kalkulieren, der zwischen den Vektoren eingeschlossen ist, denn es gilt auch, dass:

$\mathbf{x}\cdot\mathbf{y} = \lVert \mathbf{x}\rVert \cdot\lVert \mathbf{y}\rVert \cdot \cos\vartheta$

Dies ist eine sehr wichtige Formel, die überall bei der multivariaten Analyse, linearen Algebra, Geometrie usw. eingesetzt wird. Doch wie beweist man eigentlich diese Kosinus-Formel, wenn man nur die erste Formel für die Definition des Skalarprodukts annimmt? Es gilt als elementar, aber ist gar nicht offensichtlich. Trotzdem hat mir kein einziges Mal ein Professor oder Lehrbuch die Verbindung zwischen diese zwei Formeln ausdrücklich erläutert.

Dazu wollte ich einen sehr knappen Gedankengang darstellen, der es ermöglicht, diese Kosinus-Formel "auf einen Blick" zu erfassen. Es ist nichts Besonderes, aber ich habe so was eigentlich nie eindeutig niedergeschrieben.

Man kann die Kosinus-Formel für das Skalarprodukt einfach beweisen, wenn man bereit ist, die folgenden "offensichtliche" (oder mindestens intuitivere) Tatsachen anzunehmen:

die Drehungen in $\mathbb R^n$ behalten die Winkel zwischen Vektoren
die Drehungen in $\mathbb R^n$ behalten Distanzen zwischen Paaren von Punkten
für jedes Paar von Vektoren $\mathbf{x},\mathbf{y}\in\mathbb R^n$ gibt es eine Drehung um den Ursprung, die sie in die XY-Ebene bringen

Dieser Gedankengang beruht sich hauptsächlich auf der grundsätzlichen algebraischen Tatsache $(x-y)^2 = x^2 - 2xy + y^2$. Anders gesagt lautet diese Gleichung:

$x\cdot y = \frac{x^2 + y^2 - (x-y)^2}{2}$

Seien $\mathbf{x},\mathbf{y}\in\mathbb R^n$ zufällige Vektoren, wenn man diese Gleichung komponentenweise auf diese zwei Vektoren anwendet, dann bekommt man die Gleichung

$\mathbf{x}\cdot\mathbf{y} = \frac{\lVert \mathbf{x}\rVert^2 + \lVert\mathbf{y} \rVert^2 - \lVert \mathbf{x} - \mathbf{y} \rVert^2}{2}$

Das heißt, dass $\mathbf{x}\cdot\mathbf{y}$ nur von den drei Distanzen $\lVert \mathbf{x}\rVert, \lVert\mathbf{y} \rVert, \lVert \mathbf{x} - \mathbf{y} \rVert$ abhängt. Laut unserer Voraussetzungen muss jede Drehung die Distanzen zwischen Punkten behalten. Sei dann $U$ eine Drehung um den Ursprung, denn behaltet es die Werte der Ausdrücke $\lVert \mathbf{x}\rVert, \lVert\mathbf{y} \rVert, \lVert \mathbf{x} - \mathbf{y} \rVert$. Daraus folgt, dass

$\mathbf{x}\cdot\mathbf{y} = U\mathbf{x}\cdot U\mathbf{y}$

Unsere dritte Voraussetzung lautet, dass es eine bestimmte Drehung $U$ gibt, die $\mathbf{x},\mathbf{y}$ gleichzeitig in der XY-Ebene bringt. Also haben wir gefolgert, dass es jedem Paar von Vektoren $\mathbf{x}, \mathbf{y}$ ein anderes Paar von Vektoren $\mathbf{x}' = U\mathbf{x}, \mathbf{y}' = U\mathbf{y}$ in der XY-Ebene entspricht, das sowohl das gleiche Skalarprodukt als den gleiche Winkelwert hat. Das heißt: die Formel gilt im Allgemeinen, solange es im zweidimensionalen Fall gilt, wobei $\mathbf{x} = (x_1,x_2)$ und $\mathbf{y} = (y_1,y_2)$.

Im zweidimensionalen Fall ist die Formel doch sehr einfach zu beweisen. Durch einer weiteren Drehung um den Ursprung kann man $\mathbf{y}$ parallel zu der X-Achse richten, damit $\mathbf{x} = (x_1,x_2)$ und $\mathbf{y} = (y_1, 0)$. In diesem allereinfachsten Fall reduziert sich die KosinusFormel auf $x_1 y_1 = \sqrt{x_1^2 + x_2^2}\cdot y_1 \cos\vartheta$ oder $\frac{x_1}{\sqrt{x_1^2 + x_2^2}} = \cos\vartheta$ Diese Gleichung ist doch bloße zweidimensionale Trigonometrie:

Und das war's!

Übrigens frage ich mich, ob man irgendeine Verbindung zwischen dem Skalarprodukt und der Galoistheorie von Rationalfunktionkörper erweisen kann. Der Kern unseres früheren Beweis war die Tatsache, dass die Drehungen um den Ursprung behalten das Skalarprodukt, also, das Polynom $x_1 y_1 + \cdots + x_n y_n$. Deshalb frage ich mich zum Beispiel, ob der Unterkörper von $\mathbb R(x_1,\cdots, x_n, y_1,\cdots, y_n)$, die aus den von gleichzeitigen linearen Drehungen behaltenen rationalen Funktionen besteht, genau $\mathbb R(\lVert \mathbf{x}\rVert^2, \lVert \mathbf{y}\rVert^2, \mathbf{x}\cdot\mathbf{y})$ ist. Das habe ich doch bis jetzt weder bestätigt noch widerlegt.

go to homepage

The posts on this website are licensed under CC-by-NC 4.0.

A deceptively simple-looking minimax problem

2025-07-15

A deceptively simple-looking minimax problem

Lately I've been brushing up on my statistics (which is only prudent given all of the buzz about ML in the software dev world right now) and I've gone down a bit of a rabbit-hole studying parameter estimation problems. Lehmann's books Theory of Point Estimation and Testing Statistical Hypotheses present parameter estimation problems in a general framework that I've found pretty insightful. Actually, as a side project I've put together a little website with a collection of parameter estimation challenges that you can solve in the browser by writing a WebR function. Check it out!

These parameter estimation problems involve making estimates of unknown quantities given only imperfect information in the form of randomly distributed data. Penalties for incorrect answers are calculated in terms of a given "loss function". An interesting subclass of these problems are the minimax problems, where you are tasked with minimizing the expected loss in the worst case, that is, minimizing the maximum possible expected loss across all possible values of the unknown quantity.

Lehmann comments that minimax problems are often very tricky to solve compared to other forms of parameter estimation problems. Of course, I had to see this for myself to believe it, so I wrote down one of the simplest minimax problems I could think of and tried to solve it.

And boy, was he right. I've been toying with this problem on and off for the past several weeks, and it's been infuriating, particularly because the problem's statement is so simple on its face. But I finally finished solving it analytically just a few days ago, and the solution is far more complicated than it has any right to be.

Anyways, here's the problem:

There is an unknown parameter $\vartheta \in [0,1]$, and you need to make a guess $\vartheta^\ast$ at the value of this parameter. You are penalized based on how far off your guess is from the true value - if the true value is $\vartheta$ and your guess is $\vartheta^\ast$, then the penalty is $L(\vartheta,\vartheta^\ast) = (\vartheta - \vartheta^\ast)^2$, that is, the squared error. The only information you are given to inform your estimate is the value of a random variable $\omega\sim \mathcal U(0,\vartheta)$, that is, the value of a uniformly distributed random value in $[0,\vartheta]$.

How can you choose your estimate in order to minimize the maximum possible expected penalty? That is, what strategy will guarantee the expected penalty to be as small as possible, regardless of the true value of $\vartheta\in [0,1]$? And what is this smallest possible expected penalty?

The "strategies" for solving this problem can be represented as "decision functions" $\delta: [0,1]\to [0,1]$ such that $\delta(\omega) = \vartheta^\ast$ gives an estimate for the parameter $\vartheta$ in terms of the random observation $\omega\sim \mathcal U(0,\vartheta)$. You can take a crack at this problem yourself on my website by writing a decision function in R, if you want.

This post will describe the winding path that I followed to the ultimate grotesque solution of this minimax problem. Enjoy! 🤡

A failed attempt

In Lehmann's Theory of Point Estimation, he mentions a very useful fact about minimax problems: if you can find a prior distribution $\Lambda$ and a Bayes solution $\delta_\Lambda$ for that prior that makes the risk function a constant function, then $\delta_\Lambda$ is automatically a minimax solution for the problem. (The risk function $R(\delta, \vartheta)$ is defined as the expected loss when a specific decision function $\delta$ is used, and the true parameter value is $\vartheta$.) He shows how to use this fact to deduce the minimax estimate for an unknown parameter $p$ given a binomially distributed random variable $X\sim\text{Binom}(n,p)$. (He does this by letting $\Lambda$ be a certain beta distribution, but as for where this idea came from in the first place, he kind of pulls it out of a hat.)

So naturally, my first step was to look for a decision function $\delta$ making the risk function $R$ constant. Then I could try to find a prior distribution for which that decision function was Bayes optimal, and my work would be done. For the risk to be constant as a function of the parameter $\vartheta$, the following expression would have to be constant as a function of $\vartheta$: $R(\vartheta) = \frac{1}{\vartheta}\int_0^\vartheta \big(\vartheta - \delta(x)\big)^2 ~ dx = C$

It took me a few weeks of on-and-off work on this problem to realize that $\delta$ can be solved for analytically. But in the meantime, I found an approximate solution for $\delta$ by discretizing the interval $[0,1]$ into a bunch of evenly spaced points and reformulating the problem as a system of linear equations that could be solved algorithmically. In fact, there are infinitely many solutions $\delta$, as for any particular solution $\delta$, another solution can be obtained from the function $x\mapsto \alpha \delta(x/\alpha)$ for any $\alpha > 0$, dilating the solution about the origin. This yields a function looking something like this:

This poses an unfortunate problem: there is no way to dilate/contract this function in such a way that it is defined on all of $[0,1]$ and is also $\leq 1$ everywhere on $[0,1]$. This means that $\delta$ cannot be the Bayes solution for any prior $\Lambda$, because it can never be optimal to guess a value of $\vartheta$ that is greater than $1$ (since $\vartheta$ only takes values in the interval $[0,1]$). This dashes any hopes of proving minimaxity via the aforementioned theorem on constant risk functions.

Of course, we could always try modifying this decision function so that it never returns an "unreasonable" estimate $\vartheta^\ast > 1$. For instance, we might consider a decision function $x\mapsto \min(1, \delta(x))$ where $\delta$ is the function depicted above that makes the risk function $R$ a constant function. But of course, this truncated version does not make the $R$ a constant function. If we use $x\mapsto \min(1, \delta(x))$ as our decision function, then the new risk function looks like this:

The maximum risk here is approximately $\approx 0.0891$, which is not too bad! But this risk function is non-constant, so there is no guarantee of minimaxity. And as we shall see in a moment, it is not, in fact, optimal.

Dubious approximation using gradient descent

After (erroneously) deciding that it was unlikely I would ever analytically find a minimax solution to this problem, I started looking for approximate numerical solutions. Initially, I had used numerical methods to approximate a decision function $\delta$ making the risk function constant. But a more direct approach would be to numerically calculate a decision function $\delta$ minimizing the maximum value of the risk function $R(\vartheta)$ by using a numerical minimization method.

My numerical approach to this problem was as follows:

Discretize the domain $\Theta = [0,1]$ into $n$ points
Discretize some initial guess $\delta$ of the decision function as a vector in $\mathbb R^n$
Express $R$ as a vector function $r:\mathbb R^n\to \mathbb R^n$ whose input is $\delta$ and whose output is a vector discretizing the risk function $R(\vartheta)$
Apply gradient descent to the objective function $\lVert r(\delta) \rVert_\infty = \max(\lVert r(\delta) \rVert_\infty)$

There is a bit of a problem with this approach, though. Although the risk $r(\delta)$ is a differentiable function with respect to the different components of the $\delta$ vector, the supremum norm $\lVert\cdot\rVert_\infty$ is not a differentiable function of its vector argument, so gradient descent cannot really be applied (in its usual form) to the objective function $\delta\mapsto \lVert r(\delta) \rVert_\infty$.

Instead, I applied a modified form of gradient descent in which at each step, only the gradient of the largest component of $r(\delta)$ is calculated and used to adjust the input vector $\delta$. This way, each step of the gradient descent algorithm focuses on decreasing the largest component of the output risk vector $R = r(\delta)$, which is necessary to decrease the maximum component of the vector $R$.

This was just my heuristic approach to the problem, and it yielded a helpful insight, as we will see in a moment. But there was nothing rigorous about this idea. I've found no reference to this modified sup-norm version of gradient descent anywhere online, so I have no idea if there are any theoretical guarantees of its convergence. Also, the usual gradient descent algorithm uses the gradients at previous steps to dynamically adjust the step size, but because my modified algorithm is constantly switching between different components of $R$ to minimize, this kind of intelligent step size calculation wasn't possible. Instead, I just picked a "small enough" static step size to see what the method would turn up.

Here's an animation of my modified gradient descent algorithm being applied to an initial decision function guess of $\delta(x) = x$:

We can see the input decision function $\delta$ and the output risk function $R$ seemingly converge to functions shaped similarly to what we saw in my original failed attempt. To me, this suggested that my initial approach might not have actually been too far off. However, the maximum risk of this numerical solution was significantly lower at $\approx 0.0747$, as compared to a maximum risk of $\approx 0.0891$ in the original attempt.

The risk functions for the original attempt and this newer attempt look similar, in that they consist of a plateau stretching until about $x\approx 0.5$ followed by a parabolic-looking dip downwards. However, a notable difference between them is the fact that in the older solution, the end of the dip at $x=1$ falls short of the height of the original plateau, while in the newer solution, the end of the dip at $x=1$ seems to match the height of the original plateau. This qualitative observation led to the conjectured exact solution described in the next section (later proven to be correct).

A conjectured analytical solution

Previously, I mentioned the idea of trying to "salvage" a non-Bayes (and non-admissible) decision function $\delta$ making the risk function $R$ a constant function by capping its values at $1$. Then, when I used a modified version of gradient descent to search for a minimax solution, I noticed that the apparent numerical solution (and its risk function) looked an awful lot like the solution and non-constant risk function that I had salvaged from my original attempt, except that the height of the plateau in the risk function starting at $R(0)$ appeared to align with the final value $R(1)$ of the risk function.

This led to my next method of attack: trying to find a decision function which both makes $R$ constant, and also makes $R(0) = R(1)$ when it is "capped" at a maximum value of $1$. In what follows, I will let $f$ denote a non-admissible decision function that makes $R$ constant, and let $\delta$ denote the decision function that results from capping it off at a maximum value of $1$.

Around this time is when I figured out how to analytically solve for functions $f$ reducing the risk function $R$ to a constant, by solving a certain differential equation. And the answer is weird.

We're looking for functions $f$ satisfying the following integral identity, for some constant $C$: $R(\vartheta) = \frac{1}{\vartheta}\int_0^\vartheta \big(\vartheta - f(x)\big)^2 ~ dx = C$ Note that this is the same as saying $\frac{d}{d\vartheta}\int_0^\vartheta \big(\vartheta - f(x)\big)^2 ~ dx = C$ Using the Leibniz integral rule we can turn this into a differential equation for the function $f$: we obtain $\big(\vartheta - f(\vartheta)\big)^2 + 2\vartheta^2 - 2\int_0^\vartheta f(x) ~ dx = 0$ From this point, the calculations are a big nicer if we make a substitution, considering the function $g$ defined as $g(\vartheta) = f(\vartheta) - \vartheta$ in place of the function $f$. With this substitution, the above becomes $g^2 + \vartheta^2 - 2\int_0^\vartheta g(x) ~ dx = 0$ Differentiating once more with respect to $\vartheta$ yields: $2g'g + 2\vartheta - 2g = 0$ or, after simplifying, $g' = 1 - \frac{\vartheta}{g}$

The solution to this differential equation is quite messy and involves defining $g$ implicitly. (To be completely honest, I found the solution at first using Wolfram, but came up with the following slick derivation after the fact.) To solve it, we'll first define a complex-valued function $z:(0,1)\to \mathbb C$ as $z(\vartheta) = \zeta_3\vartheta + g(\vartheta)$, where $\zeta_3$ is the complex cube root of unity in the upper half-plane. Then note that by the differential equation for $g$, we have $\frac{dz}{d\vartheta} = \zeta_3 + g' = \frac{\zeta_6}{g}\cdot z$ which means that the following quantity has to be purely real, since $g$ is a real-valued function: $\zeta_6^{-1}\cdot \frac{1}{z}\frac{dz}{d\vartheta}$ But note this is simply the derivative of $\zeta_6^{-1}\log(z)$. Thus, since the derivative of $\zeta_6^{-1}\log(z)$ with respect to $\vartheta$ is a purely real quantity, the derivative of its imaginary part with respect to $\vartheta$ vanishes, and we have that it must have constant imaginary part: $\text{Im}(\zeta_6^{-1}\log z) = C$ By decomposing $z$ into its real and imaginary parts in terms of $g$ and $\vartheta$ and computing the imaginary part of $\zeta_6^{-1}\log z$ in terms of these quantities (and simplifying a bit, absorbing some numbers into the arbitrary constant $C$), we obtain the following gross-looking definition for $g(x)$ as an implicit function of $x$: $\tfrac{1}{2}\log(g^2 - xg + x^2) - \tfrac{1}{\sqrt{3}}\arctan\Big(\frac{x-2g}{\sqrt{3} x}\Big) = C$

Nasty! But if you open up a graphing calculator and plot the curve defined by this equation:

$\tfrac{1}{2}\log((f-x)^2 - x(f-x) + x^2) - \tfrac{1}{\sqrt{3}}\arctan\Big(\frac{3x-2f}{\sqrt{3} x}\Big) = C$

you will miraculously find a shape that looks just like the graph we saw earlier.

As discussed before, the empirical results of my (sus) gradient descent had led me to believe that the true minimax solution is a truncated version of the decision function $f$ (for some value of the constant $C$) defined as follows, where $x_\ast$ is the smallest real value $x$ for which $f(x) = 1$, i.e. the point at which $f$ turns into a decision function that no longer makes sense: $\delta(x) = \begin{cases} f(x) & \text{if }x \leq x_\ast \\ 1 & \text{else}\end{cases}$ and further, that $\delta$ it is such that its risk function $R$ satisfies $R(0) = R(1)$. It turns out that we can actually calculate the exact value of the constant $C$ (and hence the implicitly defined function $f$) for which this condition on the risk function is satisfied. This condition is equivalent to $f_0^2 = \int_0^1 \big(1-\delta(x)\big)^2 ~ dx$ or, since $\delta = 1$ for all $x\geq x_\ast$, the condition is $f_0^2 = \int_0^{x_\ast} \big(1-f(x)\big)^2 ~ dx$ Using the properties of $f$ and the differential equation for $g$, the RHS can actually be reduced to a pretty simple form in terms of $x_\ast$. I'll leave the details as an exercise, but it can be shown that the RHS can be reduced to the simple expression $f_0^2 - (1-x_\ast)(2x_\ast^2 - 3x_\ast + 1)$, from which it follows that the above equation implies $(1-x_\ast)(2x_\ast^2 - 3x_\ast + 1) = 0$ The only solution to this equation other than $x_\ast = 1$ (which is not possible) is $x_\ast = 1/2$ - a remarkably simple answer to a very complicated question! This means that if the condition $R(0)=R(1)$ holds then we must have $x_\ast = 1/2$, and consequently $f(1/2) = 1$, or $g(1/2) = 1/2$. Setting $(x,g) = (1/2,1/2)$ in the implicit equation defining $g(x)$ gives the following value of the constant $C$: $C = -\log 2 + \frac{\pi}{6\sqrt{3}}\approx -0.391$ which yields the following minimax loss value: $\sup_{\vartheta\in [0,1]} R(\vartheta) = f(0)^2 = \tfrac{1}{4}e^{-2\pi/3\sqrt{3}}\approx 0.0746$ which agrees numerically with the minimax loss value that I found empirically.

After a few weeks of banging my head against the wall, I wasn't expecting to find an exact value for the minimax expected loss at all. And if I had, I certainly wouldn't have expected that kooky number!

Proof of minimaxity

At this point, I had a numerical minimax solution, and I had an analytical description of a function agreeing with it numerically which I conjectured to be minimax. But still no proof of its minimaxity.

But finally I realized that another result from Lehmann on minimax solutions could still be applied, since it had slightly weaker hypotheses. In the weaker theorem, the risk function $R$ is not required to be constant. It is just required to be constant on a set of unit measure with respect to the prior distribution for which $\delta$ is Bayes. That is, $\delta=\delta_\Lambda$ needs to be a Bayes solution for some prior distribution $\Lambda$, and the risk function $R$ only needs to be constant on some subset $A\subset [0,1]$, so long as that subset has probability $\Lambda(A) = 1$.

The solution $\delta$ that we've constructed above has a risk function $R(\vartheta)$ that is not constant. However, it is constant on the set $[0,1/2]\cup {1}$. This result means that if we can find a prior distribution $\Lambda$, supported on the set $[0,1/2]\cup {1}$, for which $\delta=\delta_\Lambda$ is the Bayes solution, then $\delta$ is in fact minimax. And this time, there are no glaring red flags about $\delta$ that preclude it from being minimax.

The most general prior distribution $\Lambda$ supported on $[0,1/2]\cup {1}$ takes the form $\Lambda(X) = p [[ 1\in X ]] + (1-p)\Lambda_0(X)$ where $p\in [0,1]$ and $\Lambda_0$ is a narrower distribution on the set $[0,1/2]$. That is, a random variable on the set $[0,1/2]\cup {1}$ must take the value $1$ with some probability $p\in [0,1]$, and must follow some other distribution $\Lambda_0$ when its value falls in the interval $[0,1/2]$.

We shall try to choose $p$ and $\Lambda_0$ in order to make our $\delta$ the Bayes solution for $\Lambda$. When the squared error loss function is used, the Bayes solution is just the expected value of the posterior distribution, meaning that $\delta_\Lambda(x) = \mathbb E[\vartheta | x] = \frac{\tfrac{p}{1-p} + \int_x^{1/2} \Lambda_0(\phi) ~ d\phi}{\tfrac{p}{1-p} + \int_x^{1/2} \tfrac{1}{\phi}\Lambda_0(\phi) ~ d\phi}$

when $x\in [0,1/2]$. When $x > 1/2$, of course we have that $\delta_\Lambda(x) = 1$, since an observation of $1$ means that the value of $\vartheta$ must be $=1$ when it is known to belong to $[0,1/2]\cup {1}$. So we want to choose a value of $p$ and a density function $\Lambda_0$ such that the above equals $f(x)$ for all $x\in [0,1/2]$.

With a bit of algebra, this can be turned into a nice differential equation for $\Lambda_0$. I found the following solution: $\begin{align*}p &= \frac{1}{1 + h(\tfrac{1}{2}) - h(0)} \\ \Lambda_0(\vartheta) &= \frac{p}{1-p} ~ \frac{dh}{d\vartheta} \end{align*}$ where $h(\vartheta) =1-\exp\bigg(-\int_0^\vartheta \Big(\frac{1}{f(\phi)-\phi} - \frac{1}{f(\phi)}\Big)\cdot f'(\phi) ~ d\phi\bigg)$ I'm not sure if this integral representation for $h$ in terms of $f$ can be simplified any further using the differential equation for $f$ - perhaps it can. But the above is sufficient to show existence of a Bayes prior $\Lambda$ for which $\delta = \delta_\Lambda$. One can check that the above does in fact define a valid density function for $\Lambda_0$ - namely, $\Lambda_0$ is positive on $[0,1/2]$ and it has unit integral over that interval - and that it satisfies the requisite integral equation to ensure $\mathbb E_\Lambda[\vartheta|x] = f(x)$ for $x\in [0,1/2]$. Just to get an idea of what this distribution looks like, here's the density function $\Lambda_0$ on $[0,1/2]$:

and the value of $p$ is approximately $\approx 0.713$.

This gives us a final definitive answer: the function $\delta$ is indeed a minimax solution of our problem. Whew!

That wraps up my extended treatment of this problem. I really was not expecting to find exact expressions for the minimax solution and minimax expected loss, and I'm blown away by how strange-looking they are. I also wasn't expecting to encounter such tricky differential equations.

Overall, a damn cool problem!

go to homepage

The posts on this website are licensed under CC-by-NC 4.0.

Untersuchung von einer schwingenden Beatty-artigen Folge

2025-06-20

Untersuchung von einer schwingenden Beatty-artigen Folge

Die Beatty-Folgen sind ein weniger bekanntes Thema der Mathematik, das mich besonders interessiert. Dazu habe ich letztens die folgende schwankende Folge von Summen untersucht, die die Verteilung von geraden und ungeraden Zahlen in der Beatty-Folge des Goldenen Schnitts enthält:

$s(n) = \sum_{k=1}^n (-1)^{\lfloor \phi k\rfloor}$

Diese Folge sieht so aus:

Wegen der Irrationalität von $\phi$ sehen die Höhen und Tiefen dieser Folge zwar ein wenig wie die einer Zufallsbewegung aus. In den Graphen tritt auch eine gewisse Selbstähnlichkeit auf. Ich habe sehr befriedigende Antworten zu den folgenden Fragen bezüglich $s(n)$ entdeckt, und ich würde den Lesern auch vorschlagen, mal zu versuchen, sie zu lösen:

Sind die Werte von $s(n)$ begrenzt?
Wenn ja, was ist eine obere Grenze für $|s(n)|$?
Wenn nein, mit welcher Wachstumsklasse wachsen die Höchstwerte und Minderwerte von $s(n)$?
Wie kann man die Werte von $s(n)$ effizient kalkulieren?

Mithilfe einiger günstigen Sätzen betreffend die Kettenbrüche lässt sich eine sehr nützliche rekursive Formel für $s(n)$ beweisen, die die Lösung dieser Fragen sehr viel vereinfacht.

Für die Konvergenten $p/q$ eines Kettenbruchs, der der irrationalen Zahl $\alpha$ entspricht, gilt: $\bigg|\alpha - \frac{p}{q}\bigg| \leq \frac{1}{q^2}$ und umgekehrt, wenn eine rationale Zahl $p/q$ näher als $1/2q^2$ zu $\alpha$ ist, muss es eine Konvergente des Kettenbruchs von $\alpha$ sein. Die Fibonacci-Zahlen sind die besondere Konvergenten der rationalen Zahl $\phi$, deshalb: seien $n,k\in\mathbb N$ sodass $n < F_{k-1}$ und sei $N$ die nächste natürliche Zahl zu $\phi n$, so gilt $|\phi n - N| \geq n\cdot \bigg|\phi - \frac{N}{n}\bigg|\geq \frac{1}{2n} > \frac{1}{2F_{k-1}}$ und andrerseits $|\phi F_k - F_{k+1}| = F_k\cdot \bigg|\phi - \frac{F_{k+1}}{F_k}\bigg|\leq \frac{1}{F_{k+1}} \leq \frac{1}{2F_{k-1}}$

denn es gilt, dass $F_{k+1} \geq 2F_{k-1}$ für jede $k \geq 1$. Das heißt, die Größe $|\phi F_k - F_{k+1}|$ ist kleiner als die Distanz zwischen $\phi n$ und der nächsten ganzen Zahl, damit $\phi n$ und $\phi n +(\phi F_k - F_{k+1})$ der gleichen Abrundung entsprechen müssen: $\lfloor \phi n \rfloor = \lfloor \phi n + (\phi F_k - F_{k+1})\rfloor$ und deshalb: $\lfloor \phi (n + F_k)\rfloor = \lfloor \phi n\rfloor + F_{k+1}$ Tatsächlich gilt diese Formel auch wann $n=F_{k-1}$, nicht nur $n < F_{k-1}$. Diese Identität ergibt eine rekursive Formel, die beim Kaltulieren größerer Werte der Funktion $s(n)$ sehr behilflich ist: $\begin{align*} s(F_{k}+n) &= \sum_{j=1}^{F_{k}+n} (-1)^{\lfloor \phi j\rfloor} \\ &= \sum_{j=1}^{F_k} (-1)^{\lfloor \phi j\rfloor} + \sum_{j=F_k+1}^{F_k+n} (-1)^{\lfloor \phi j\rfloor} \\ &= \sum_{j=1}^{F_k} (-1)^{\lfloor \phi j\rfloor} + \sum_{j=1}^{n} (-1)^{\lfloor \phi (j+F_k)\rfloor} \\ &= \sum_{j=1}^{F_k} (-1)^{\lfloor \phi j\rfloor} + \sum_{j=1}^{n} (-1)^{\lfloor \phi j\rfloor + F_{k+1}} \\ &= s(F_k) + (-1)^{F_{k+1}}s(n) \end{align*}$ Die Folge von Restklassen modulo $2$ der Fibonacci-Zahlen ist periodisch modulo $3$, damit jede dritte Fibonacci-Zahl gerade ist. Deshalb können wir die Formel auf diese Weise simplifizieren: $s(F_k + n) = s(F_k) + (-1)^{[\![ k \not\equiv 2 \bmod 3 ]\!]}s(n)$

Mit Hilfe von dieser Rekursionsgleichung ist es ganz einfach zu beweisen, dass die Teilfolge $s(F_k)$ deshalb auch periodisch ist. Offensichtlich ist $s(F_{k+1})$ nur auf $s(F_k)$, $s(F_{k-1})$ und $k\bmod 3$ abhängig und durch manuellen Rechnung kann man einfach bestätigen, dass $s(F_1) = s(F_7)$ und $s(F_2) = s(F_8)$ und davon folgern, dass $s(F_k) = s(F_{k+6})$ für alle $k\in\mathbb N$:

$\begin{array}{l|l|l|l} n & F_n & (-1)^{F_n} & s(F_n) \\\hline 1 & 1 & -1 & -1 \\ 2 & 1 & -1 & -1 \\ 3 & 2 & 1 & -2 \\ 4 & 3 & -1 & -1 \\ 5 & 5 & -1 & 1 \\ 6 & 8 & 1 & 0 \\ 7 & 13 & -1 & -1 \\ 8 & 21 & -1 & -1 \\ 9 & 34 & 1 & -2 \\ 10 & 55 & -1 & -1 \\ 11 & 89 & -1 & 1 \\ 12 & 144 & 1 & 0 \\ \vdots & \vdots & \vdots & \vdots \\ \end{array}$

Diese Periodizität trivialisiert die Rechnung des Glieds $s(F_k)$ in der rekursiven Formel und deshalb ergibt sie im Grunde eine Beziehung zwischen den Werten von $s(n)$ im Bereich $[F_k,F_{k+1}]$ und den Werten von $s(n)$ im Bereich $[1, F_{k-1}]$.

Der Satz von Zeckendorf besagt, dass jede natürliche Zahl in eine Summe von Fibonacci-Zahlen aufgelöst werden kann, derart, wobei keine zwei aufeinanderfolgenden Fibonacci-Zahlen in der Summe vorkommen. Die natürliche Zahlen lassen sich algorithmisch sehr effizient in ihre Zeckendorfzerlegungen auflösen. Deshalb ermöglicht die rekursive Formel für $s(n)$ die Rechnung von exakten Werten von $s(n)$ auch wenn $n$ so groß ist, dass es gar nicht praktisch wäre, $s(n)$ als eine Summe von $n$ Termen zu berechnen. Zum Beispiel, ich habe ein kleines Haskell-Programm geschrieben, die unsere Formel benutzt, um die folgende Werte von $s(n)$ zu rechnen:

$\begin{array}{l|l} n & s(F_n) \\\hline 10^{10} & 0 \\ 10^{50} & 2 \\ 10^{100} & -12 \\ 10^{500} & -4 \\ 10^{1000} & 30 \\ 10^{5000} & 10 \\ 10^{10000} & 72 \\ \end{array}$

Mithilfe der rekursiven Formel ist es auch ziemlich einfach zu beweisen, dass die folgende Formeln für sonderliche Werten der Folge $s(n)$ gelten:

$\begin{align*} & s(F_3 + F_6 + \cdots + F_{6n+3}) = -2n \\ & s(F_3 + F_6 + \cdots + F_{6n+3} + F_{6n+4}) = 2n-1 \\ & s(F_3 + F_6 + \cdots + F_{6n+6}) = 2n \\ & s(F_3 + F_6 + \cdots + F_{6n+6} + F_{6n+8}) = -(2n+1) \\ \end{align*}$

was bestätigt, dass $s(n)$ weder nach oben noch nach unten beschränkt ist. (Ich glaube, dass diese Eingabewerte für $s$ sind genau jene, wo es seine Höchstwerte und Mindestwerte zum ersten Mal annimmt, aber es ist mir noch nicht gelungen, das zu beweisen.) Wir können folgern, dass die Höchstwerte von $s(n)$ logarithmisch wachsen, denn die Fibonacci-Zahlen exponentiell wachsen: $\sup_{1\leq k\leq n} s(k) = \mathcal O(\log n)$

Zum Schluss möchte ich diesen Blogeintrag mit ein paar weiteren Fragen abschließen. Erstens: kannst du diese Technik erweitern, um die Rechnung von Summen wie z.B. $\sum_{k=1}^n \omega^{\lfloor \alpha k\rfloor}$ zu ermöglichen, wobei $\omega$ eine komplexe Einheitswurzel ist und $\alpha\notin \mathbb Q$? Zweitens: offensichtlich ergibt $s(n)$ eine divergierende Reihe als $n\to\infty$, aber kannst du beweisen, ob oder nicht diese Reihe sich durch einer anderen Summierungsweise wie z.B. Cesàro-Summierung einen Wert zuzuschreiben läßt? Der folgende Graph, der die Cesaro-Partialsummen von $s(n)$ halblogarithmisch darstellt, deutet überzeugend an, dass diese Summen auch so divergieren:

go to homepage

The posts on this website are licensed under CC-by-NC 4.0.

Eine Technik zur Vereinfachung von aus Rekursionsgleichungen entstandenen Wachstumsklassen

2025-06-01

Eine Technik zur Vereinfachung von aus Rekursionsgleichungen entstandenen Wachstumsklassen

Kannst du die asymptotische Wachstumsklassen der Folgen ausrechnen, die durch die folgenden Rekursionsgleichungen definiert sind? $\begin{align*}T(n) &= T(n - \sqrt{n}) + \tfrac{1}{\sqrt{n}} \\ T(n) &= T(n - \sqrt{n}) + \sqrt{\tfrac{\log n}{n}} \\ T(n) &= T(n-\log^2 n) + \tfrac{1}{n} \\ T(n) &= T(n ~ / ~ 2) + \tfrac{1}{\log n} \\ T(n) &= T(n ~ / ~ 2) + e^{\sqrt{\log n}} \\ T(n) &= 2 ~ T(n ~ / ~ 2) + \tfrac{n\log\log n}{\log n}\end{align*}$

In einem späteren Entwurf meiner Bachelorthese, die hauptsächlich die grundsätzliche Eigenschaften der asymptotischen Wachstumsklassen und die Partialsummen behandelte, habe ich mich ein bisschen in die Rekursionsgleichungen vertieft. Leider war dieser Abschnitt der These nicht ausführlich genug entwickelt, um im entgültigen Entwurf gehalten zu sein. Aber drin gab es eine tolle Lösungstechnik, die mindestens einen Blogeintrag verdient!

Ein Schlüsselbegriff, der innerhalb meiner These entwickelt wird, ist die Mäßigkeit der Wachstumsklassen. Meiner Definition nach, die Wachstumsklasse von einer Folge $(a_n)$ ist mäßig, wenn $(b_n)\in\Theta(n) ~ \implies ~ a_{b_n} = \Theta(a_n)$ d.h., wenn man $(a_n)$ durch einer Folge $(b_n)$, die der linealen Wachstumsklasse entspricht, neu indiziert, ist seine Wachstumsklasse nicht geändert. Diese Eigenschaft hängt nur von der Wachstumsklasse von $(a_n)$ ab, also, wenn $a_n = \Theta(a'_n)$ und $(a_n')$ mäßig ist, dann muss $(a_n)$ auch mäßig sein. Viele häufige Wachstumsklassen zeigen diese Eigenschaft (z.B. polynomiales Wachstum, logarithmisches Wachstum, alle ihre Summen, Produkte und Wurzeln, usw) und es hat zur Folge viele andere günstige Eigenschaften.

Unter ihnen ist eine einfache aber sehr wichtige Eigenschaft, die sich uns als sehr nützlich zur Lösung der Rekursionsgleichungen beweisen wird. Es heißt: wenn $(a_n)$ mäßig ist und $b_n = \mathcal O(n)$, dann kann mann folgern, dass $\sum_{k=n}^{n + b_n} a_n = \Theta(a_n b_n)$ Diese Schlussfolgerung lässt sich ganz einfach beweisen und es ist von selbst nicht so bahnbrechend, aber wir werden gleich sehen, wie es zur Trivialisierung einiger Rekursionsgleichungen führt.

Überleg mal die folgende Rekursionsgleichung: $T(n) = T(n - a_n) + b_n$ wo $T(0) > 0$ und $(a_n)$ eine Folge von natürlichen Zahlen ist, derart, dass $a_n < n$ für alle $n\in\mathbb N$ (damit $T(n-a_n)$ wohldefiniert ist). Unter bestimmten Bedingungen kann man beweisen, dass die Wachstumsklasse von $T$ nur von den respektiven Wachstumsklassen von $(a_n)$ und $(b_n)$ abhängig ist. Diese Unabhängigkeit hat zur Folge, dass man $(a_n)$ oder $(b_n)$ willkürlich durch günstiger Folgen ersetzen kann, die aus der gleichen Wachstumsklassen gezogen sind, um die Berechnung der Wachstumsklasse von $T$ zu erleichtern. Und wenn $(a_n)$ und $(b_n)$ mäßig sind, dann (so behauptet die Eigenschaft der mäßigen Wachstumsklassen, auf die wir hingewiesen haben) trifft zu: $b_n = \Theta\bigg(\sum_{k=n-a_n+1}^n \frac{b_k}{a_k}\bigg)$ damit die durch der folgenden Rekursionsgleichung definierten Funktion $T^\ast$ der gleichen Wachstumsklasse als $T$ entspricht: $T^\ast(n) = T^\ast (n - a_n) + \sum_{k=n-a_n + 1}^n \frac{b_k}{a_k}$ Aber es ist ganz einfach zu beweisen, dass die folgende Definition für $T^\ast$ diese Rekursionsgleichung bestätigt: $T^\ast(n) = \sum_{k=0}^n \frac{b_k}{a_k}$ und deshalb: $T(n) = \Theta\bigg(\sum_{k=0}^n \frac{b_k}{a_k}\bigg)$ Diese Technik reduziert die Berechnung der Wachstumsklasse von $T$ auf die Berechnung der Wachstumsklasse dieser Partialsumme, und in meiner These habe ich ganz ausführlich viele Technik zur Berechnung der Wachtumsklasse von Partialsummen entwickelt. Kurz gefasst wissen wir, dass solange $(a_n)$ und $(b_n)$ bestimmte ziemlich einfache Bedingungen erfüllen:

$T(n) = T(n - a_n) + b_n ~ \implies ~ T(n) = \Theta\bigg(\sum_{k=1}^{a_n} \frac{b_k}{a_k}\bigg)$

Zum Beispiel:

$\begin{align*}T(n) = T(n - \sqrt{n}) + \frac{1}{\sqrt{n}} ~ &\implies ~ T(n) = \Theta(\log n) \\ T(n) = T(n - \sqrt{n}) + \sqrt{\frac{\log n}{n}} ~ &\implies ~ T(n) = \Theta\big((\log n)^{3/2}\big) \\ T(n) = T(n-\log n) + \frac{1}{n} ~ &\implies ~ T(n) = \Theta(\log\log n) \\ T(n) = T(n-\log^2 n) + \frac{1}{n} ~ &\implies ~ T(n) = \Theta(1)\end{align*}$

Die gleiche Technik kann einer ähnlichen Klasse von Rekursiongleichungen gewidmet werden: die sogennante Teile-und-Herrsche Rekursionen, die sehr oft in der Lehre von theoretischen Algorithmen auftuachen und die Folgende Form annehmen:

$T(n) = \alpha ~ T(n ~ / ~\beta) + a_n$

wo $\alpha > 0$ und $\beta > 1$. Bevor man jene Technik anwendet muss man einen Ersatz durchführen: wenn man $c = \log_\beta\alpha$ definiert, dann kann man folgern

$n^{-c} T(n) = (n/\beta)^{-c} ~ T(n ~ / ~\beta) + n^{-c} a_n$

und man kann deshalb beweisen (hier verbergen wir viele mühsame Einzelheiten, die mann in einem gültigen Beweis durcharbeiten muss), dass $T(n) = \Theta(n^c T^\ast(n))$, wo $T^\ast$ durch der folgenden Rekursionsgleichung definiert ist: $T^\ast(n) = T^\ast(n/\beta) + \frac{a_n}{n^c}$ Diese Rekursionsgleichung ist geeignet für unsere frühere Technik. Wenn $(a_n)$ mäßig ist, dann würde $T^\ast$ der gleichen Wachstumsklasse entsprechen, wenn wir es so definiert hätten: $T^\ast(n) = T^\ast(n/\beta) + \sum_{k=n/\beta + 1}^n \frac{a_k}{k^{c+1}}$ Wie früher, diese Rekursionsgleichung lässt sich ganz einfach durch einer analytischen Formel simplifizieren: $T^\ast(n) = \sum_{k=1}^n \frac{a_k}{n^{c+1}}$ damit $T(n) = \Theta\bigg(n^c \sum_{k=1}^n \frac{a_k}{n^{c+1}}\bigg)$ Also, insgesamt wissen wir (noch mal durch bestimmten Eigenschaften von $(a_n)$ bedingt), dass

$T(n) = \alpha ~ T(n ~ / ~\beta) + a_n ~ \implies ~ T(n) = \Theta\bigg(n^c \sum_{k=1}^n \frac{a_k}{k^{c+1}}\bigg)$

Diese Technik lässt sich auch in vielen Fällen anwenden, in den der sogennanten Master-Theorem wegen Beschränkungen auf $(a_n), \alpha, \beta$ leider nicht gilt, sogar ohne auf fortgeschrittene Befunde z.B. von dem Kalkül zu beruhen. Zum Beispiel, hier sind ein paar von aus Rekursionsgleichungen entstandene asymptotische Formeln, die sich durch den Master-Theorem nicht lösen lassen:

$\begin{align*}T(n) = T(n ~ / ~ 2) + \frac{1}{\log n} ~ &\implies ~ T(n) = \Theta(\log \log n) \\ T(n) = T(n ~ / ~ 2) + e^{\sqrt{\log n}} ~ &\implies ~ T(n) = \Theta\big(n\sqrt{\log n}\cdot e^{\sqrt{\log n}}\big) \\ T(n) = 2 ~ T(n ~ / ~ 2) + \frac{n\log\log n}{\log n} ~ &\implies ~ T(n) = \Theta(n \log^2\log n)\end{align*}$

go to homepage

The posts on this website are licensed under CC-by-NC 4.0.

Variación acotada y el operador de traslado

2025-05-20

Variación acotada y el operador de traslado

Recientemente he tenido que tramitar algunos datos físicos de tipo serie cronológica e investigar formas de medir la exactitud de un modelo que produce tales series. Una medida típica es la norma $\lVert \cdot\rVert_1$, con la que se puede cuantificar la "distancia" entre funciones reales $f,g$ así: $\lVert f-g\rVert_1 := \int_\mathcal{D} |f-g| ~ dt$ Sucede que el modelo físico que estoy investigando produce a veces datos ligeramente retrasados temporalmente. Eso me ha animado a pensar un poco más teóricamente (aunque probablemente no importe mucho para ese proyecto) cuánto error puede surgir de translados pequeños. Es decir, si $T_\Delta$ representa el operador de translado: $T_\Delta f(x) := f(x + \Delta)$ entonces ¿qué se puede decir sobre la distancia $\lVert T_\epsilon f - f\rVert_1$, cuando $\epsilon$ es un número real positivo y pequeñito?

Poco cuesta convencerse de que para la mayoría de funciones no patológicos, tanto funciones continuas como funciones escalonadas, la norma $\lVert T_\epsilon f - f\rVert_1$ disminuye linealmente como función de $\epsilon\to 0$. Pero precisamente ¿qué condiciones tiene que cumplir $f$ para que esto sea verdad?

He encontrado una condición suficiente en $f$ muy chula para que $\lVert T_\epsilon f - f\rVert_1 = \mathcal O(\epsilon)$ mientras que $\epsilon\to 0$: la variación acotada. Aquí me pongo a bosquejar una prueba y también presentar un ejemplo de una aplicación sin la propiedad de variación acotada para la cual $\lVert T_\epsilon f - f\rVert_1$ disminye más lentamente.

Primero demonstramos que cualquiera aplicación integrable $f:[0,1]\to \mathbb R$ con variación acotada cumple $\lVert T_\epsilon f - f\rVert_1 = \mathcal O(\epsilon)$. Para aclarar: en este dominio $[0,1]$ definimos el operador $T_\Delta$ de manera envolvente como si fuera el dominio un toro, es decir, definimos $T_\Delta f(x)$ como $T_\Delta f(x + \Delta \bmod 1)$. Así que, primero suponemos que $f:[0,1]\to \mathbb R$ es una función acotada, y que $B > 0$ sea una cota superior concreta en su variación.

Que sea $\epsilon \in (0, 1)$ arbitrario y que sea $2n$ el número entero par más grande para el cual $2n\epsilon < 1$. Como $f$ tiene variación acotada por hipótesis, también tiene que ser acotada, así que

$\int_0^1 |T_\epsilon f - f| ~ dx = \int_0^{2n\epsilon} |T_{\epsilon} f - f | ~ dx + \mathcal O(\epsilon)$

Definimos ahora una función $\Delta(I,\epsilon)$ así, cuyos argumentos son un intervalo $I\subset [0,1]$ y un número real positivo pequeño $\epsilon > 0$:

$\Delta(I,\epsilon) := \sup_{x\in I} |f(x+\epsilon) - f(x)|$

Como antes, de la expresión $f(x+\epsilon)$ entendemos $f(x+\epsilon \bmod 1)$. Fíjate que esta expresión siempre le da a $\Delta(I,\epsilon)$ un valor finito como $f$ es acotada. Podemos derivar una cota mayor en $\lVert T_\epsilon f - f \rVert_1$ partiendo el intervalo de integración en muchas partes y expresando el valor máximo de $| T_\epsilon f - f|$ en cada subintervalo en términos de $\Delta$:

$\int_0^1 | T_\epsilon f - f | ~ dx = \sum_{k=0}^{2n-1}\int_{k\epsilon}^{(k+1)\epsilon} |T_\epsilon f - f| ~ dx \leq \epsilon\sum_{k=0}^{2n-1} \Delta(k\epsilon + [0,\epsilon], ~ \epsilon)$

Que sea $\delta \ll 1/n$ (por ejemplo, $\delta = 1/n^2$) y que sea $x_0,x_1,\cdots, x_{2n-1}$ una serie de valores tales que $x_k\in k\epsilon + [0,\epsilon]$ y $|f(x_k + \epsilon) - f(x_k)| \geq \Delta(k\epsilon + [0,\epsilon], ~ \epsilon) - \delta$, pues tales valores tienen que existir debido a la definición de $\Delta$ como un supremo. Repartiendo la suma que aparece en nuestra cota superior en sus términos pares e impares, se tiene que:

$\sum_{k=0}^{2n-1} \Delta(k\epsilon + [0,\epsilon], \epsilon) = \sum_{k=0}^{n-1} \Delta(2k\epsilon + [0,\epsilon], \epsilon) + \sum_{k=0}^{n-1} \Delta((2k+1)\epsilon + [0,\epsilon], \epsilon)$

Considerando primero la suma con los términos pares, se ve que

$\sum_{k=0}^{n-1} \Delta(2k\epsilon + [0,\epsilon], \epsilon) \leq n\delta + \sum_{k=0}^{n-1} |f(x_{2k}+\epsilon)-f(x_{2k})|\leq n\delta + B$

debido a la definición de la cota $B$ en la variación de $f$, junto con el hecho de que $x_0, x_0 + \epsilon, x_1, x_1 + \epsilon, \cdots$ es una sucesión creciente dentro del intervalo $[0,1]$. Semejantemente se puede derivar una cota superior para los términos impares. Sumando estas cotas se obtiene una cota en la suma entera:

$\sum_{k=0}^{2n-1} \Delta(k\epsilon + [0,\epsilon], \epsilon) \leq 2n\delta + 2B$

Y pues como $\delta > 0$ podría haber sido arbitrariamente pequeña, se tiene también:

$\sum_{k=0}^{2n-1} \Delta(k\epsilon + [0,\epsilon], \epsilon) \leq 2B$

Entonces obtenemos una cota de $\mathcal O(\epsilon)$ en el integral que consideramos:

$\int_0^1 | T_\epsilon f - f | ~ dx \leq 2B\epsilon + \mathcal O(\epsilon) = \mathcal O(\epsilon)$

En cuanto a contraejemplos, no es muy difícil encontrar funciones que faltan la propiedad de variación acotada tales que $\lVert T_\epsilon f - f\rVert_\epsilon$ no es $\mathcal O(\epsilon)$. Considérate por ejemplo $f(x) = \log x$ o bien $f(x) = 1/\sqrt{x}$ para $x\in (0,1]$ junto con $f(0) = 0$. Cuesta un poco más encontrar contraejemplos continuos, pero se los puede hallar mediante una construcción parecida al seno del topólogo.

Define la aplicación $f:[0,1]\to\mathbb R$ por segmentos en una familia de intervalos $[0,1/2), [1/2,3/4), [3/4,7/8), \cdots$ de tamaño que disminuye geométricamente, tal que en cada intervalo $f$ es un sinusoide cuyo periodo parte su intervalo respectivo. Se puede asignarle al sinusoide número $n$ un número entero de periodos $N_n$ y una amplitud $a_n$, tal que para cada $n\in\mathbb N$, $f(x) = a_n\sin\big(2^{n+1}N_n\pi x\big)$ para todo $x\in [1-2^{n}, 1-2^{n+1})$. Una función así definida será continua siempre y cuando $a_n\to 0$ mientras que $n\to\infty$.

Dada esta definición, se ve que cuando se traslada $f$ por $\epsilon = (2^{n+1} N_n)^{-1}$, o sea medio periodo del sinusoide número $n$, se produce "interferencia destructiva" dentro del n-ésimo intervalo de tal manera que los puntos máximos de $y = f(x)$ y de $y = f(x + \epsilon)$ se desalinean.

Se puede demonstrar que la cota inferior siguiente vale para este valor fijo de $\epsilon$: $\lVert T_\epsilon f - f\rVert_1 \geq + \frac{4}{\pi} \cdot \frac{a_n}{2^n}\cdot \Big(1 - \frac{1}{N_n}\Big)$

Si se define, por ejemplo, $a_n = 1/n$ y $N_n = n^2 2^n$, entonces $\lVert T_\epsilon f - f\rVert_1 = \Omega(\sqrt{\epsilon})$ según esta cota!

Todavía me quedan algunas preguntas no contestadas. ¿Es posible que $\lVert T_\epsilon f - f\rVert_1$ ni disminuya a $0$ mientras que $\epsilon\to 0$, siendo $f$ una aplicación continua? Aunque no es cierto para toda función integrable $f$ que $\lVert T_\epsilon f - f\rVert\to 0$ mientras que $\epsilon\to 0$, ¿tiene que ser $0$ un púnto límite de esta cantidad mientras $\epsilon\to 0$?

go to homepage

The posts on this website are licensed under CC-by-NC 4.0.

Ideal scaling of spaced repetition

2025-02-20

Ideal scaling of spaced repetition

I'm a huge fan of spaced repetition for language learning. This method helped me retain vocabulary at a surprising rate while first learning German, and lately I've been trying to use it to learn Russian as well. Пока мне очень нравится... русский язык — интересный и красивый. Может быть, я когда-нибудь напишу сочинение об интересных вещах в грамматике русского языка.

If you're unfamiliar, here's a rundown on how spaced repetition works:

You have a deck of flashcards, each of which has a prompt and one or more correct answers
At any given time, each card is either "due" or "not due"
When a card is first introduced, it is not due, and becomes due in $\Delta$ hours, where $\Delta$ is a parameter (in my Russian deck, $\Delta = 8\text{ hr}$)
When you answer a card correctly, the amount of time before it next becomes due is multiplied by a fixed parameter $\varphi_+ > 1$ (in my Russian deck, $\varphi_+ = 1.2$)
When you answer a card incorrectly, the amount of time before it next becomes due is multiplied by a fixed parameter $\varphi_- < 1$ (in my Russian deck, $\varphi_+ = 0.5$)

Due cards "pile up" while you're away, and you generally sit down once or twice a day, or maybe once every couple days, to clear them all out. The flashcards that you know well will have their interval repeatedly inflated by a factor $\varphi_+ > 1$, so that you encounter it less and less frequently, making more room for cards that more urgently need your attention. Meanwhile, as you miss a card multiple, its interval decays by $\varphi_- < 1$ each time, increasing the frequency with which you are asked to review it.

Studying flashcards with SR is part of my daily routine, and it can be an overwhelming habit to keep up at times, especially when a bunch of cards become due all at once and need to be cleaned out. I find myself tweaking how often I add new cards, and how many I add in each batch, to avoid giving myself an avalanche of new cards.

This got me thinking about how the workload of an SR flashcard deck scales over time. This post outlines some heuristic calculations meant to shed light on how the workload of an SR deck scales under ideal circumstances.

Let's consider the workload of maintaining a spaced repetition deck under the best possible circumstances: the user answers each card the moment it becomes due, and always gets it correct. While this is (sadly) not at all how things go in my SR deck, it should give a reasonable "best-case" analysis of how quickly cards pile up. Aside from that, my own flashcards go through a preliminary studying phase before entering the SR loop, so that I already know them somewhat well before they ever become due. For this reason, I'm usually able to answer $80-90\%$ of my due cards correctly, and this "ideal behavior" assumption may not be as preposterous as it sounds.

From here onward, I'll denote $\varphi_+$, the "correct factor", by $\varphi$, since $\varphi_-$ doesn't affect anything when the user never gets any cards wrong.

Under the above conditions, the sequence of intervals between successive due times of each card will be $\Delta, \Delta\varphi, \Delta\varphi^2,$ and so on. Hence, the time between each card's zeroth appearance and its nth appearance is given by $\Delta + \Delta\varphi + \cdots + \Delta\varphi^n = \Delta\frac{\varphi^{n+1}-1}{\varphi - 1}$ Note that while a card's interval grows exponentially with respect to the number of appearances of a card, it does not grow exponentially with respect to time. The formula above can be used to calculate how many times a card will have become due by time $t$: $n = \bigg\lfloor \frac{1}{\log\varphi}\cdot\log\bigg(1 + \frac{\varphi - 1}{\Delta}\cdot t\bigg)\bigg\rfloor$

The above, as a function of $t$, directly measures how much work an ideal user must do to maintain that individual card on its own (by tallying how many times they must answer it as time progresses). The fact that this is a step function makes it a little inconvenient to work with. So instead, I'll work with a function $\ell(t)$, which I'll call the load function, that more smoothly measures "the rate at which a flashcard adds work for the user" in a deck. It is defined to be zero when $t < 0$ (a card produces no work before it is added to the deck) and defined as follows for $t\geq 0$:

$\ell(t) := \frac{1}{\log\varphi}\cdot\frac{1}{t+\frac{\Delta}{\varphi - 1}}$

This function is meant as a way of "smoothing out" the discrete arrival times of flashcards, and thereby also smoothing out the cumulative function tracking the number of times a card has been studied over time. It has the property that the integral $\int_a^b \ell(\tau) ~ d\tau$ is approximately the number of times that the card is studied between $a$ and $b$ time units after its introduction to the deck, where this approximation has error $\leq 1$ at any given time. In particular, look how closely it matches our formula for the number of times $n$ that the card becomes due before time $t$: $\int_0^t \ell(\tau) ~ d\tau = \frac{1}{\log\varphi}\cdot\log\bigg(1 + \frac{\varphi - 1}{\Delta}\cdot t\bigg)$

If $\mathcal D$ represents a deck consisting of many cards $c\in\mathcal D$ with respective starting times $t_c$, then we have that the function $\Lambda(t) := \sum_{c\in\mathcal D} \ell(t - t_c)$ performs a similar role for the deck as a whole, tracking the rate at which the deck "accumulates work to be done". That is, the integral $\int_a^b \Lambda(\tau) ~ d\tau$ approximates the total number of cards studied between time $a$ and time $b$, with error at most $|\mathcal D|$. (If $[a,b]$ is a small interval, this error is quite significant, so this only makes sense as an actual approximation for the number of cards studied when the interval is long.) We can think of this "cumulative load function" $\Lambda(t)$ as a metric for how "busy" the deck is at any given time, in terms of how quickly new cards are arriving on average. Just as $\ell(t)$ smooths out the discrete arrival of a single card into a differentiable power curve, we can think of $\Lambda(t)$ as smoothing out the arrivals of due cards in an entire deck into a (mostly) differentiable "flow of cards".

Notice what happens to the load when a number of cards have already been added to a deck and no more are added. The load $\Lambda(t)$ will clearly decay at time progresses, since it is a sum of decreasing functions in $t$. Furthermore, these individual load functions are each $\Theta(t^{-1})$, so we can say that $\Lambda(t) = \Theta(t^{-1})$ as well, provided that no new cards are added.

Now let's consider what happens when a perfect user adds cards to a deck in batches of $K$ cards at a regular tempo, with a delay of $T$ between card batches. This mimics my own usage of spaced repetition at times, when I have a semi-regular regimen of adding new cards to my deck. It might also give a good representation of deck usage by someone who is studying vocabulary for a language class, where they receive approximately the same amount of new vocabulary each day or week according to a certain curriculum.

Under these conditions, we have $\Lambda(t) = \sum_{n=0}^{\lfloor t ~/ ~T\rfloor} K\ell(t-nT) = \frac{K}{\log\varphi}\sum_{n=0}^{\lfloor t ~/ ~T\rfloor} \frac{1}{(t-nT) + \frac{\Delta}{\varphi - 1}}$ This is nearly a harmonic sum, and reversing the sum's indices makes it clear how $\Lambda$ behaves asymptotically as $t\to\infty$: $\Lambda(t) = \frac{K}{\log\varphi}\sum_{n=0}^{\lfloor t ~/ ~T\rfloor} \frac{1}{Tn + T\{t/T\} + \frac{\Delta}{\varphi - 1}} = \frac{K\log (t)}{T\log(\varphi)} + \mathcal O(1)$

In other words, if you continue adding flashcards to your deck at a regular pace, your workload over time will grow logarithmically.

What if we don't want the flashcard load to grow unboundedly over time? I, for one, am often hesitant to add new cards to my deck when I start waking up in the morning to find 150-200+ due cards in my deck. As a result, I've noticed that in the long term, the frequency with which I add new cards actually decreases as I try to keep the load beneath a reasonable bound.

This suggests the theoretical question: to keep the card load bounded, at what rate must the influx of new cards decrease asymptotically?

Let's suppose that cards are introduced in batches of $K$, and that batch number $n$ is introduced at time $t= A(n) = n\alpha(n)$, where $\alpha(n)$ is some monotone increasing function that controls the growing interval between batches. The constant case of $\alpha(n) = T$ reflects the situation we've already considered, in which cards are introduced periodically with a delay of $T$ units of time. Anything greater than constant growth of $\alpha$ indicates longer and longer waits between batches.

The load $\Lambda$ has local peaks at the times when new cards are introduced, so bounding $\Lambda$ by an upper bound of $B< 0$ is equivalent to bounding its values at the times $t = A(n)$ for $n\in \mathbb N$. Thus, we need to figure out what growth rates of $\alpha$ will keep the following sum bounded:

$\Lambda(A(N)) = \sum_{n=0}^{N} K\ell\big(A(N)-A(n)\big) = \frac{K}{\log\varphi}\sum_{n=0}^{N} \frac{1}{A(N)-A(n) + \frac{\Delta}{\varphi - 1}}$

We can derive a decent upper bound for $\Lambda(A(N))$ by splitting the sum into two pieces:

$\sum_{n=0}^{N/2}\frac{1}{A(N)-A(n) + \frac{\Delta}{\varphi - 1}} + \sum_{n=N/2+1}^{N}\frac{1}{A(N)-A(n) + \frac{\Delta}{\varphi - 1}}$

For the former sum, we have

$\sum_{n=0}^{N/2} \frac{1}{A(N)-A(n) + \frac{\Delta}{\varphi - 1}} \leq \sum_{n=0}^{N/2} \frac{1}{A(N/2) + \frac{\Delta}{\varphi - 1}} \leq \frac{1}{\alpha(N/2)}$

And for the latter sum:

$\sum_{n=N/2+1}^{N} \frac{1}{A(N)-A(n) + \frac{\Delta}{\varphi - 1}} \leq \sum_{n=0}^{N/2} \frac{1}{n\alpha(N/2) + \frac{\Delta}{\varphi - 1}} \leq \frac{\varphi-1}{\Delta} + \frac{\log(N) + 1}{\alpha(N/2)}$

Hence, we have the following upper bound on the peak load:

$\Lambda(A(N)) \leq \frac{K}{\log\varphi}\bigg(\frac{\varphi-1}{\Delta} + \frac{\log(N) + 2}{\alpha(N/2)}\bigg)$

Thus, if $\alpha(n) = \Omega(\log n)$, the load is bounded above by a constant, making it $\mathcal O(1)$. Furthermore, when $\alpha(n)$ does exhibit logarithmic growth $\Theta(\log n)$, this bound shows that the load can be tempered by varying the hidden constant to make this upper bound grow or shrink.

The take-away? That if you want your workload in an SR deck to remain bounded over time, you may need to wait longer and longer intervals between new additions to your deck, with these intervals growing logarithmically. (Luckily, logarithmic growth is very slow.) More concretely, if you want your work load to eventually plateau at a steady rate of at most $\approx B$ due cards per hour, then these intervals may need to grow at a rate of approximately $\alpha(N) \sim \frac{1}{\frac{B\log\varphi}{K} - \frac{\varphi - 1}{\Delta}}\cdot \log(N) ~ \text{hrs}$

go to homepage

The posts on this website are licensed under CC-by-NC 4.0.