Probability on Free Semigroups

Let \(I\) be a countable alphabet of letters, and let \(S\) denote the set of all finite length words using letters from \(I\). The empty word, with no letters, is denoted \(e\). Let \(\cdot\) denote the concatenation operation on elements of \(S\). That is, suppose that \(m, \, n \in \N\) and that \(x_i, \, y_j \in I\) for \(i \in \{1, 2, \ldots, m\}\) and \(j \in \{1, 2, \ldots, n\}\). Then \(x = x_1 \cdots x_m \in S\) and \(y = y_1 \cdots y_n \in S\), and \[x y = x_1 \cdots x_m y_1 \cdots y_n\] The space \((S, \cdot)\) is a discrete, positive semigroup, known as the free semigroup generated by \(I\).

The set of words \(S\) is sometimes denoted \(I^*\), and the elements of \(I\) are the irreducible elements of \((S, \cdot)\). The adjective free is used because there are no algebra rules imposed on \((S, \cdot)\). Here are two examples that we will use in exercises:

The binary semigroup is the free semigroup on the alphabet \(\{0, 1\}\), and so the simplest non-trivial free semigroup. The nonempty words are simply bit strings of finite length.

The genetic semigroup is the free semigroup on the alphabet \(\{A, C, G, T\}\). In genetics, the letters refer to the nucleotides that are the base building blocks of DNA: adenine, cytosine, guanine, and thymine, respectively. Nonempty words may represent genes.

For \(x \in S\) and \(i \in I\), let \(d_i(x)\) denote the number of times that letter \(i\) occurs in \(x\), and let \(d(x) = \sum_{i \in I} d_i(x)\) denote the length of \(x\).

Suppose again that \((S, \cdot)\) is the free semigroup corresponding to the countable alphabet \(I\).

Let \((S, \preceq)\) denote the partial order graph associated with \((S, \cdot)\). Then \(x \preceq y\) if and only if \(y = x t\) for some \(t \in S\), so that \(x\) is a prefix of \(y\) and then \(x^{-1}y\) is the corresponding suffix.
Let \((S, \upa)\) denote the cover graph of \((S, \preceq)\). Then \(x \upa y\) if and only if \(y = x i\) for some \(i \in I\). Hence \((S, \upa)\) is a regular tree rooted at \(e\) of degree \(\#(I)\).
For \(n \in \N\), let \(\upa^n\) denote the composition power of \(\upa\) of order \(n\). Then \(x \upa^n y\) if and only if \(y = x t\) for some \(t \in S\) with \(d(t) = n\).

Since \((S, \upa)\) is a rooted tree, all of the results in Section 1 apply. In particular, if \(\#(I) = k \in \N_+\), then \((S, \upa)\) is a regular rooted tree of degree \(k\). Specializing further, when \(k = 2\) and \(I = \{0, \,1\}\), so that we have the binary semigroup described in , then the corresponding cover graph \((S, \upa)\) is the binary tree. Note also that \(d(x)\) is the distance from \(e\) to \(x\) in the cover graph \((S, \upa)\) for \(x \in S\), so our notation is consistent. More generally, if \(x \preceq y\) then \(d(x, y) = d(y) - d(x)\) is the distance from \(x\) to \(y\) in \((S, \upa)\). Like all rooted trees, \((S, \preceq)\) is a lower semi-lattice and for \(x, \, y \in S\), \(x \wedge y\) is the largest common prefix of \(x\) and \(y\). But \((S, \preceq)\) is not an upper semi-lattice unless \(I\) consists of a single letter.

Let \(\N_I = \left\{(n_i: i \in I): n_i \in \N \text{ for } i \in I \text{ and } n_i = 0 \text{ for all but finitely many } i \in I \right\} \). We define the multinomial coefficients on \(\N_I\) by \[C(\bs{n}) = \#\left\{x \in S: d_i(x) = n_i \text{ for } i \in I\right\} = \frac{\left(\sum_{i \in I} n_i \right)!}{\prod_{i \in I} n_i!}, \quad \bs n = (n_i: i \in I) \in \N_I\]

The free semigroup has the property that\([e, x]\) is finite and totally ordered for each \(x \in S\). Moreover, it is the only discrete positive semigroup with this property.

Suppose that \((S, \cdot)\) is a discrete positive semigroup with associated partial order \(\preceq\) and with the property that \([e, x]\) is finite and totally ordered for each \(x \in S\). Then \((S, \cdot)\) is isomorphic to a free semigroup on an alphabet.

Details:

Let \(I\) denote the set of irreducible elements of \((S, \cdot)\). If \(x \in S - \{e\}\) then \(i_1 \preceq x\) for some \(i_1 \in I\) and hence \(x = i_1y\) for some \(y \in S\). If \(y \succ e\) then we can repeat the argument to write \(y = i_2z\) for some \(i_2 \in I\) and \(z \in S\). Note that \(x = i_1 i_2 z\) and hence \(i_1 i_2 \preceq x\). Moreover, \(i_1\) and \(i_1 i_2\) are distinct. Since \([e, x]\) is finite, the process must terminate. Thus, we can write \(x\) in the form \(x = i_1 i_2 \cdots i_n\) for some \(n \in \N_+\) and some \(i_1, i_2, \ldots, i_n \in I\). Finally, we show that the factorization is unique. Suppose that \(x = i_1 i_2 \cdots i_n = j_1 j_2 \cdots j_m\) where \(i_1, \ldots, i_n \in I\) and \(j_1, \ldots, j_m \in I\). Then \(i_1 \preceq x\) and \(j_1 \preceq x\). Since the elements of \([e, x]\) are totally ordered, we must have \(i_1 = j_1\). Using left cancellation we have \(i_2 \cdots i_n = j_2 \cdots j_m\). Continuing in this fashion it follows that \(m = n\) and \(i_1 = j_1\), \(i_2 = j_2, \ldots, i_n = j_n\).

Suppose that \((S, \cdot)\) is a discrete positive semigroup with the property that every \(x \in S\) has a unique finite factoring over the set of irreducible elements \(I\). Then \((S, \cdot)\) is isomporphic to the free semigroup on \(I\).

Details:

Every \(x \in S\) has a finite factoring over \(I\), so that \(x = i_1 i_2 \cdots i_n\). If this factoring is unique, then clearly \((i_1, i_2, \ldots i_n) \mapsto i_1 i_2 \cdots i_n\) is an isomorphism from the free semigroup \((I^*, \cdot)\) onto \((S, \cdot)\).

The left walk functions and the left generating functions for the graphs \((S, \preceq)\), \((S, \upa)\) and \((S, \Upa)\) are given in Section 1, as is the Möbius kernel for \((S, \preceq)\). Since we have a discrete positive semigroup, we can give the more basic Möbius function:

The Möbius function \(\mu\) of \((S, \cdot)\) is given by \(\mu(e) = 1\), \(\mu(i) = -1\) for \(i \in I\), and \(\mu(x) = 0\) for all other \(x \in S\).

\(\dim(S, \cdot) = \#(I)\).

Details:

Note first that a homomorphism \(\phi\) from \((S, \cdot)\) into \((\R, +)\) can uniquely be specified by defining \(\phi(i)\) for all \(i \in I\) and then defining \[\phi(i_1 i_2 \cdots i_n) = \phi(i_1) + \phi(i_2) + \cdots \phi(i_n)\] Thus, if \(\phi\) is a such a homomorphism and \(\phi(i) = 0\) for all \(i \in I\), then \(\phi(x) = 0\) for all \(x \in S\). Now suppose that \(B \subseteq S\) and \(\#(B) \lt \#(I)\). We will show that there exists a nonzero homomorphism from \((S, \cdot)\) into \((\R, +)\) with \(\phi(x) = 0\) for all \(x \in B\). Let \(I_B\) denote the set of letters contained in the words in \(B\). Suppose first that \(I_B\) if a proper subset of \(I\) (and note that this must be the case if \(I\) is infinite). Define a homomorphism \(\phi\) by \(\phi(i) = 0\) for \(i \in I_B\) and \(\phi(i) = 1\) for \(i \in I - I_B\). Then \(\phi(x) = 0\) for \(x \in B\), but \(\phi\) is not the zero homomorphism. Suppose next that \(I_B = I\). Thus \(I\) is finite, so let \(k = \#(B)\) and \(n = \#(I)\), with \(k \lt n\). Denote the words in \(B\) by \[i_{j 1} i_{j 2} \cdots i_{j m_j}, \quad j = 1, 2, \ldots, k\] The set of linear, homogeneous equations \[\phi(i_{j 1}) + \phi(i_{j 2}) + \cdots + \phi(i_{j m_j}) = 0, \quad j = 1, 2, \ldots, k\] has \(n\) unkowns, namely \(\phi(i)\) for \(i \in I\), but only \(k\) equations. Hence there exists a non-trivial solution. The homomorphism so constructed satisfies \(\phi(x) = 0\) for \(x \in B\).

We can use the strict positive semigroup to provide a simple example of a discrete semigroup in which multiples of counting measure \(\#\) are not the only left invariant measures.

Suppose that \(I\) has at least two elements and consider the strict positive semigroup \((S_+, \cdot)\) associated with the free semigroup \((S, \cdot)\). That is, \(S_+ = S - \{e\}\) while \(\cdot\) is still concatenation. Define the function \(g: S_+ \to (0, \infty)\) by \(g(x) = c_i\) if the terminal letter of \(x\) is \(i \in I\), where \(c_i \in (0, \infty)\) for each \(i \in I\) and \(c_i \ne c_j\) for some \(i, \, j \in I\). Let \(\lambda\) be the positive measure on \(S_+\) with density \(g\) (relative to \(\#\)). That is, \[\lambda(A) = \sum_{x \in A} g(x) = \sum_{i \in I} c_i n_i(A), \quad A \subseteq S_+\] where \(n_i(A)\) is the number of words in \(A\) with terminal letter \(i\) for each \(i \in I\). Then \(\lambda\) is left invariant: If \(x \in S\) and \(A \subseteq S\) then the words in \(x A\) have the same terminal letters as the corresponding words in \(A\), so \(\lambda(x A) = \lambda(A)\). But \(\lambda\) is clearly not a multiple of counting measure.