A vector technique for unsupervised lexeme discovery

Lately I've been an active contributor to Tatoeba, a huge open-source collection of parallel sentences in many different world languages. Aside from being an amazing resource for the languages I'm studying, it's given me exposure to languages that I didn't even know existed before, and it's also a great dataset for NLP projects.

I've been mulling over the following question: given a bunch of example sentences translated into your own language, would it be possible to algorithmically deduce translations for individual lexemes/morphemes? When done manually, this task is fairly simple. For instance, I don't know any Hungarian, but if I were given the following Hungarian translations of English sentences:

English	Hungarian
Tom filled the bottle with drinking water.	Tom megtöltötte az üveget ivóvízzel.
Tom drinks at least three liters of water every day.	Tom naponta legalább három liter vizet iszik.
If it weren't for water, humans wouldn't survive.	Ha nem lenne víz, az emberek nem élnék túl.
The water came up to our knees.	A víz térdig ért.
I would like some water.	Kérek egy kis vizet.

...then after staring at these sentences for a while, I would be able to guess that the word for water in Hungarian is víz without any prior knowledge. This is because the only thing in common between the English sentences is the word water, whereas for the Hungarian sentences it seems to be víz or viz (sometimes with additional prefixes/suffixes). In fact, I would also be willing to guess that ivóvízzel means drinking water.

Really, all I did was look at these sentences, find some common subsequences of characters, and make a heuristic judgment about what the most likely translation of the word water would be. This process seems like it should be susceptible to automation, so I gave it a try!

Most of the NLP tools and algorithms that I've learned about are for processing text at the word/morpheme level, and presuppose a tokenizer/lemmatizer/stemmer for the target language. This task, however, occurs at the character level and concerns how we discover lexemes/morphemes in the first place. I'm not familiar with many NLP techniques that work at this low of a level, so this problem has been a very fun challenge!

Below, I explain my approach and discuss some of its current weaknesses.

But first, a little eye candy!

Given a list of sentences in a target language with a word/phrase/substring in common between their English translations, my code calculates a kind of "heatmap" on each sentence, assigning each position in the sentence a score between 0 and 1 quantifying its local similarity to other sentences. We can visualize the results for specific sentences by graphing the scores by character index. Here's an example in Hungarian, generated when searching for a translation of the word water:

I also have a utility for visualizing this "relevance score" by highlighting segments of sentences in the target language with varying levels of saturation. Here's what this looks like for the same example in Hungarian:

Candidate words can be obtained from sentences by extracting segments containing the highest scores. By defining a custom string distance metric on the extracted strings and performing hierarchical clustering, we can also obtain a heuristic grouping of the words into clusters comprising possible lexemes. These clustered words or word forms can then be visualized as a dendrogram. Here's a dendrogram output by my code for the same example in Hungarian:

Although most of my test runs have used parallel sentences from Tatoeba, data can be ingested from any TSV-formatted file of parallel sentences. I was also able to import some data in Kannada (a Dravidian language spoken in India) from the Anuvaad Parallel Corpus and test out my algorithm on it. Here's the resulting dendrogram when I asked it to infer possible translations for the word beautiful:

Jump to the end of the post for a huge table of examples showing lexeme guesses for a few very common words in several different languages. Though my code is still pretty rough around the edges, I'm very happy with how the results are coming out so far, and I wonder if it has the potential to be developed into something more sophisticated like an unsupervized best-effort lemmatizer for languages lacking established lemmatization tools.

If you're interested, you can check out my code in a Jupyter notebook here on Github. I encourage you to play around with it! Parallel sentence data from several sample languages is included in the repo, so no additional downloads (aside from Python packages) should be necessary.

Now, here's a more in-the-weeds description of my approach.

The problem under consideration is as follows: given a bunch of sentences in language $L$ whose translations contain a certain word $w$ (or more generally, matching a certain regex), produce one or more "candidate morphemes" in the language $L$ that might serve as translations of $w$. I'm calling this problem "unsupervised" because I'm not using ground truth data (such as dictionaries in various languages) to train any sort of model to recognize words.

My first thought was to use an n-gram model to analyze common sequences of characters in example sentences. Given a set $S$ of sentences in language $L$ with translations containing the target word $w$, we could tabulate the frequencies of 2-grams or 3-grams among those sentences. Then the "hottest" substrings in each sentence could be identified as the ones containing more high-frequency n-grams on average.

I implemented this approach and it worked shockingly well for many languages. However, there was a huge drawback for languages like Arabic and Hebrew that have non-contiguous word roots. For instance, in Hebrew, the word for to read has the 3-letter root קרא, and conjugating this verb sometimes involves inserting letters in between: he reads becomes קורא, inserting the letter ו. This is a big problem for the n-grams approach, because when scoring these words in a collection of sentences, the 2-grams קר and קו would compete with each other in the 2-gram frequency count, causing different conjugations of to read to detract from each others' scores.

My immediate next thought was to use an generalized version of n-grams called "skip-grams", in which both contiguous and non-contiguous letter combinations are tabulated, e.g. there might be a frequency category not only for the substring קר, but also an additional category for occurrences of ק and ר separated by one or fewer characters, or by two or fewer characters, and so on. The problem with this approach is that the number of possible categories grows very quickly and it's not obvious what kind of scoring system should be used to take them all into account.

The idea I'm about to describe occurred to me at around midnight one night, and I ended up staying up until about 3am frantically coding up a proof-of-concept - I had to know if it would work! Vector embeddings were fresh in my mind because of a recent online course in NLP, but I had never seen a vector embedding technique applied to individual characters rather than words.

Say $C$ is the set of all characters in the language $L$. These characters might be normalized to avoid distinguishing characters that are "really the same", e.g. capitalized versus lowercase versions of the same letter, or accented versus non-accented variants, etc. Each character $c\in C$ is assigned a unit vector $\phi(v)\in \mathbb R^d$ where $d$ is the dimension of the embedding. These embeddings should either assign orthogonal vectors to different characters (in which case we must have $d\ge |C|$) or very nearly orthogonal vectors to different characters (in which case you can often do with fewer than $|C|$ dimensions). This ensures that different characters are handled independently.

Once we have a character embedding, we define a way of embedding pairs of characters separated by a certain number of indices in a string. This can be defined by a function $\psi:C^2\times {0,\cdots,\ell}\to\mathbb R^d$, where $\psi(c_1,c_2,j)$ is the embedding for $c_1$ followed by $c_2$ after $j$ characters, and $\ell$ is the "lookahead value" determining the maximum level of separation represented by the embedding. I've experimented with a few different options for this embedding, but the general idea is that $\psi(c_1,c_2, i)$ and $\psi(c_1,c_2, j)$ should be somewhat similar to each other, especially when $i,j$ are close, in order to allow embeddings of the same character $c_2$ in slightly different positions after $c_1$ to "constructively interfere" with each other. In this way, $\psi$ acts sort of like a "fuzzy" n-grams frequency table that avoids huge proliferation of frequency categories by allowing some of them to conflate with each other. Further, $\psi(c_1,c_2,i)$ and $\psi(c_1,c_3,j)$ should be orthogonal or near-orthogonal when $c_2\ne c_3$.

One option I've tried has been $\psi(c_1,c_2,j) := \cos\Big(\frac{\pi j}{2\ell+1}\Big)\phi(c_2)$ and another is the following, where $U$ is a unitary matrix that is close to the identity and $\alpha < 1$ is some constant: $\psi(c_1,c_2,j) := (\alpha U)^j \phi(c_2)$ Both of these work pretty well, but I suspect that the results can be improved by more intelligently designing the function $\psi$, and this is a detail I want to continue experimenting with.

Next, we define a function $\Psi$ such that $\Psi(i, s)$ gives an embedding combining the character pair embeddings $\psi(s[i], -, -)$ for several of the characters following $s[i]$, up to the character $s[i+\ell]$ at the lookahead threshold: $\Psi(i, s) := \sum_{j=i}^{i+\ell} \psi(s[i], s[j], j-i)$ And then, given a whole collection of sentences $S$, we define a combined embedding $\overline{\Psi}(c)$ that, intuitively speaking, summarizes the "average context" of the character $c$ in all of the places it appears in all the sentences of $S$. It is defined as follows: $\overline{\Psi}(c, S) := \frac{1}{1+\tfrac{|S|}{|\{s\in S: ~ c \in s\}|}}\cdot\sum_{s\in S}\sum_{s[j] = c} \frac{\Psi(j, s)}{\#(c, s)}$

This is an average of all of the embeddings $\Psi(j, s)$ of the positions where the character $c$ appears across all sentences, averaged across different appearances of the character $c$ in each sentence $s$. Making this an average rather than a sum is vital, both because it prevents extremely long sentences from affecting these embeddings disproportionately, and because it prevents high-frequency characters from having much larger embeddings in general. It is also multiplied by a scaling factor punishing characters that occur only in a small number of the sentences in $S$.

Finally, for each sentence $s$ in $S$, each of its characters is scored by calculating the cosine similarity of each character's local embedding in that specific sentence with its global embedding across all of the sentences in $S$. That is:

$\text{score}(i, s) = \frac{\overline{\Psi}(c, S)\cdot \Psi(i, s)}{\lVert\overline{\Psi}(c, S)\rVert\cdot \lVert\Psi(i, s)\rVert}$

When a character is followed by sequences of characters that frequently follow it in many of the sentences in $S$, then the vectors $\overline{\Psi}(c, S)$ and $\Psi(i, s)$ should point in similar directions, meaning that $\text{score}(i, s)$ should be larger.

In my scripts, I also apply some final post-processing to the character scores $\text{score}(i, s)$ for each sentence. For one, I scale and translate the scores into the interval $[0,1]$ by subtracting the minimum score and scaling by the difference between the min and max scores. I also smooth the scores across each sentence by taking a windowed average, and apply a power function such as $x\mapsto x^4$ because it accentuates the difference between higher and lower scores. This is how we get the "heatmap" highlighted sentences and graphs showcased earlier. Extracting the words occurring at the peaks of these graphs is how relevant words are extracted from sentences.

This technique still has several kinks that need to be worked out. For instance, in its current form, it does not distinguish subsequences that are common within a certain subset of sentences from subsequences that are common throughout the language as a whole. For that reason, the results of the above process often contain some irrelevant high-frequency strings corresponding to common words similar to the, a/an and I in English, for example. The same goes for the names Tom and Mary, which are extremely common in the Tatoeba corpus (to the point of being an inside joke of the Tatoeba community). Perhaps character scores could be modified by penalizing characters whose local embeddings are too similar to their global embedding in the language as a whole.

On a similar note, even if a certain word is not common in the language as a whole, it may co-occur very commonly with the target word. Consider for instance the words read/reads/reading and book. Naturally, they co-occur in a lot of the English sentences of the Tatoeba corpus, so that this technique might be likely to, say, mis-identify the Hungarian word for book as an appropriate translation of to read. I still haven't made up my mind about how to remedy this issue.

Finally, there is a key type of deduction that we use easily when manually inferring words' meanings, but my vector method does not take advantage of. Let me illustrate it with another example. Consider the following parallel sentences in English and Latvian. From these sentences, can you guess a translation for the word milk?

English	Latvian
No, I never drink coffee with milk.	Nē, es nekad nedzeru kafiju ar pienu.
Boris never confronted Rima.	Boriss nekad nestājās pretī Rimai.
Don't drink alcohol.	Nedzeriet alkoholu.
I didn't drink any coffee today.	Es šodien nedzēru kafiju.
Do you actually like your coffee with salt?	Vai jums tiešām garšo kafija ar sāli?
No, I can't.	Nē, es nevaru.

You could probably infer that pienu means milk even though it only appears in one of these sentences. This is because the remaining words in that sentence also appear in at least one of the other sentences, but milk does not appear in any of their English translations. That is, we have applied a process of elimination to deduce a translation for the word milk, which is a heuristic that my code does not (yet) attempt to use.

To sum up, the things I'd still like to improve, in brief, are:

find a way of dealing with overrepresented named entities in the Tatoeba corpus, e.g. Tom and Mary
penalize common letter combinations throughout the language
come up with a way to filter out words that commonly co-occur with a target word
incorporate an add-on that also takes into account eliminative strategies

Here's a big fat table showing my algorithm's output for a few common words in several different languages, in case you would like to get a feel for how well it works and the kinds of errors it makes. I recommend Wikitionary for looking up the meanings of these words if you want to check their definitions for accuracy.

dog

cat

book

bread

water

milk

home

day

eat

sleep

read

black

white

big

small

ber
(Berber)

aydinni
uydinni
aydi
aydia
uydi
aydinneɣ
aydinnek
aydinnes
aydiinu
weydi

amcicnni
umcicnni
amcica
amcic
umcic
amuccnni
amcicinu
imucca
yimucca
imcac

adlisnni
udlisnni
idlisen
yidlisen
adlis
adlisa
yedlisen
adlisnnes
adlisinu
udlis

aɣrum
uɣrum
weɣrum
aɣrumnni
uɣṛum
aɣrumnnes
weɣrumnni
weɣruminu
aqbur
ara

waman
wamana
aman
watay
yeḥman
amanaya
amannni
amandin
mani
ameqqran

akeffay
akeffaya
ukeffay
akeffaynni
ukeffaynni
ukeffayis
akeffaynnek
ayefki
uyefki
yefkaiyid

ɣer
ɣef
deg
seg
yedda
yebda
yella
yelli
taddart
tamaneɣt

wass
ass
assa
wussan
ussan
ussana
asmi
assnni
assnsen
yessen

isett
nsett
ttetteɣ
ttetten
setteɣ
setten
teččed
teččeḍ
iḥemmel
ikemmel

teṭṭes
yeṭṭes
neṭṭes
yeṭṭsen
yeḍḍes
teṭṭseḍ
teṭṭsed
yettaṭṭas
yelzem
yiḍes

yeqqar
yeqqard
yeɣra
yeɣrad
yeɣri
adlis
adlisa
udlis
idlisen
yidlisen

aberkan
taberkant
iberkanen
tiberkanin
krayellan
tsednan
aberqemmuc
dakken
asgainna
ayisnnek

amellal
umellal
tamellalt
imellalen
yimellalen
tmellalt
mellul
tmellalin
timellalin
mellulet

tameqqrant
tameqrant
ameqran
ameqqran
ameqṛan
timeqqranin
timeqranin
imeqranen
meqqren
aḥeqqar

amecṭuḥ
tamecṭuḥt
mecṭuḥit
mecṭuḥet
imecṭuḥen
tameẓyant
tamurt
taḥanut
teɣlust
anect

ell
(Greek)

σκύλος
σκύλους
σκύλο
σκύλου
σκυλί
σκυλιά
δύσκολο
του
σου
σκότωσε

γάτα
γάτας
γάλα
γάτες
γάτος
είναι
είσαι
φοβάται
κοιμάται
τα

βιβλίο
βιβλίου
βιβλία
τίτλος
έβαλες
βάλε
το
του
ιστορικά
ανήκει

ψωμί
ψωμιού
σκορδόψωμο
μέρα
κάνω
αυτοί
τομ
μισό
έκοψε
είναι

νερό
νερά
νερού
άερα
πίνει
πίνεις
καλύτερο
είναι
έργα
δεν

γάλα
για
υγεία
σόγιας
λίγο
αλλεργικός

σπίτι
στις
πάτε
πόδια
είναι
τεράστιου
ποια
παιδιά
στο
σπό

μέρα
ημέρα
μέσα
μέρες
ημέρες
μέχρι
χώρα
σήμερα
μια
μία

τρώνε
τρώει
να
ένα
τρώω
τομ
τον
φάω
φάε
τα

κοιμάμαι
κοιμάται
κοιμάσαι
κοιμήθηκα
κοιμήθηκαν
κοιμήθηκε
κοιμήθηκες
κοιμόταν
κοιμούνται
κοιμηθεί

διαβάζει
διαβάζεις
διαβάσει
διαβάσεις
διαβάσω
διαβάζω
διάβασα
διάβασμα
διάβαζα
διάβασε

μαύρο
μαύρος
μαύρα
μαύρη
μαύρες
τομ
του
τον
το
αγοριού

άσπρο
άσπρος
άσπρα
άσπρη
άσπρους
εκείνα
είναι
εμφανίζεται
έναν
ένας

μεγάλο
μεγάλος
μεγάλοι
μεγάλα
μεγάλη
μεγάλε
μεγάλες
μέγαλος
μεγαλύτερη
μεγαλουπόλεις

μικρός
μικρό
μικρή
μικρά
μικρού
είναι
μεσαία
μένα
ένα
ενός

hun
(Hungarian)

kutyát
kutyád
kutyám
kutyák
kutyákat
kutyámat
kutyáját
kutyánkat
kutyája
kutyánk

macskákat
macskádat
macska
macskája
macskát
macskám
macskád
macskákért
macskával
macskánk

könyvet
könyveit
könyvét
könyvei
könyved
könyvek
könyve
könyveket
könyvedet
könyveim

kenyeret
kenyérhez
kenyérre
kenyérben
kenyerünk
bundáskenyeret
kenyér
kenyérből
kent
milyen

vizet
vizem
vized
vízen
vízben
vízzel
vízhez
vízre
vízbe
vizünk

tejet
tejed
tejjel
tehenet
tejből
tej
teheneket
vajat
fejni
sajt

otthon
itthon
otthonom
hazafele
hazafelé
otthonukról
haza
házat
tom
tomi

nap
napig
napok
napot
napon
napja
napom
napod
napokra
naponta

eszem
eszel
eszik
eszi
szeretnél
szeretnék
esznek
eszünk
vettem
ettem

aludni
elaludni
aludnom
alszik
alszok
aludj
aludt
aludjunk
aludtunk
aludtam

olvastad
olvastam
olvassam
olvasni
elolvasni
olvasod
olvasok
olvasom
olvasol
elolvastam

fekete
feketébe
feketék
feketében
felhőket
koromfekete
szeretem
nekem
feketepiacról
végezte

fehér
fehérre
fehérbe
falfehér
fehérnél
fehérbor
elfehéredik
megfehéredett
festette
fordult

nagy
vagy
nagyon
nagyok
mary
vagyok
egy
nagyvárosban
nagyvárosok
hogyan

kicsi
kocsim
kicsiben
kisvárosban
kisvárosból
kis
cicije
kisbicskát
szókincsed
kilátást

hye
(Armenian)

շունը։
շունը
շունդ
շունն
շուն
անունը
շանը
շանը։
շան։
ունի

կատուն
կատուն։
կատու։
կատուս
կատու
կատուները
կատուները։
կատուներ
կատուների
կատվին։

գիրքը։
գիրքը
գրքեր
գրքերը
գրքերն
գրքեր։
գիրքն
գիրք
գրել
գրքում։

հաց
հացը
հաց։
հացն
հացը։
գնեցի։
գնեց։
գնելիս։
առավ։
պատվիրեցի։

ջուր
ջուրը
ջուր։
ջուրը։
մաքուր
նոր
ունի
ջրով
խմում։
ու

կաթը
կաթ
կաթի
կաթ։
կաթը։
կատուն
խմել։
խմել
եմ
են

տուն
տուն։
տանն
տա՞նն
տանը
տանը։
տան
շուտ
տանել։
յաննին

երեկ
երեք
ամեն
մենք
մերին
երբեք
այն
տանն
համար
նրան։

ուտում։
ուտու՞մ։
ուտում
ուտո՞ւմ
ուտու՞մ
ուզում
ուտել։
ուտես։
ուտելու
ուտելու։

քնում։
քնում
քնել։
քնեք։
քնելը
քնեց։
քնո՞ւմ
քնել
քնեցի
քնեցի։

կարդացել
կարդացե՞լ
կարդում
կարդում։
կարդո՞ւմ
կարդացել։
կարդալ
կարդալ։
կարդա։
կարդաց

սև
սա
ես
այս
են։
եք։
ամեն
ամպերով։
մեքենան
նա

սպիտակ
սպիտակ։
պատերը
պատը
տունը։
սա
առյուծը
է։

մեծ
մե՞ծ
մեծ։
մենք
ամեն
է։
չէ։
են։
եմ
աչքեր

փոքր
փոքրիկ
բնակարանը
բառարանը
է։
էր։
որքա՞ն
մեր
երկիր
էր

ind
(Indonesian)

anjing
anjingku
anjingmu
anjingnya
ingin
jangan
anaknya
anggur
siang
makanan

kucing
kucingku
kucingmu
kucingnya
bukan
makan
temukan
ikan
menyukai
ini

buku
bukuku
bukumu
bukan
bukunya
suka
baru
bukubuku
aku
kesukaanmu

roti
rotinya
dari
tom
itu
turun
wanita
memberikan
air
mentega

airnya
air
dari
hari
ada
udara
mandi
mineral
sendiri
pantai

susu
susunya
sudah
sebelum
nasi
sapi
dua
setiap
dari
di

rumah
kerumah
rumahmu
rumahku
rumahnya
sebuah
hujan
bukan
apakah
pulang

hari
sehari
harimu
hasil
harga
nasi
hampir
seharian
harihari
kemarin

makan
akan
makanan
memakan
dimakan
maukah
malam
ikan
mana
kacang

tidur
tertidur
tidurlah
tidak
ribut
yaitu
menidurkanku
badak
dua
itu

membaca
membacakan
dibaca
beberapa
baca
dibacanya
majalah
sebuah
bukunya
padaku

hitam
kita
wanita
minum
tanpa
itu
melihat
pakaian
tikus
putih

putih
seputih
batubatu
hitam
salju
itu
ini

besar
sebesar
gambar
sejajar
semua
seluas
osaka
sebuah
sebelum
terkadang

kecil
memiliki
sempit
lakilaki
mencarikan
terlalu
kita
tetapi
tinggal
ini

isl
(Icelandic)

hundinn
hundurinn
hundinum
hundanna
hundarnir
hundur
kötturinn
hundasýningu
eigandinn
hundar

kötturinn
köttinn
köttur
hundurinn
maðurinn
kettir
ketti
kattar
kettinum
kött

bókina
bókin
bókinni
bók
bóka
bókarinnar
bækurnar
bækur
tekur
kemur

brauð
brauðbita
borðarðu
borðaði
borða
að
með
er

vatn
vatns
vatni
vatnið
vatninu
vatnsglas
kranavatn
vertu
flöskunni
fötunni

mjólk
mjólkar

heima
heim
heiman
heimilið
eins
heimabæinn
til
mig
minnir
er

daginn
dagurinn
dagsins
dag
daga
dagana
enginn
degi
segir
lengi

borðar
borða
borðað
borðaði
borðum
borðarðu
orðin
borðaðirðu
brauð
að

sofa
sofið
svefni
svefns
sofandi
sofnaði
svefn
svafst
svaf
hafa

lesa
lesið
þessa
lestu
skáldsöguna
skáldsögu
elska
lestur
enska
þessar

svartir
svartur
svört
svart
svörtu
svörtum
kolsvart
svartklædd
var
stór

hvítar
hvíta
hvítt
hvít
hvítur
hvítklædda
hvað
þetta
hvítvínsglas
eða

stórt
stór
stóra
stóri
er
ert
eru
stóran
stórir
en

lítill
lítil
lítið
litlum
litlir
litla
hluti
með lítið
bill
leit

kan
(Kannada)

ಪ್ರಾಣಿಗಳ
ಪ್ರಾಣಿಗಳು
ಇಲ್ಲಿವೆ
ಇಲ್ಲಿಗೆ
ಮಾತ್ರವಲ್ಲದೆ
ಮಾತ್ರವಲ್ಲದೇ
ಇಲ್ಲಿ
ಇಲ್ಲಿನ
ಪ್ರಾಣಿಗಳಾದ
ಕತ್ತೆ

ಚಿರತೆಗಳು
ಚಿರತೆಗಳ
ಕಾಡು
ಕಂಡು
ಬೆಕ್ಕು
ಬೆಕ್ಕಿನ
ಶ್ರೇಣಿಗಳನ್ನು
ಪ್ರಾಣಿಗಳನ್ನು
ಕಾಣಬಹುದು
ಕಾಣಬಹದು

ಪುಸ್ತಕಗಳು
ಪುಸ್ತಕಗಳ
ಪುಸ್ತಕಗಳನ್ನು
ಪುಸ್ತಕವನ್ನು
ಪ್ರವಾಸಿಗರು
ಪ್ರವಾಸಿಗರಿಗೆ
ಮತ್ತು
ವಸ್ತು
ಎತ್ತರ
ಪುಸ್ತಕಗಳಿವೆ

ಹಾಗು
ಪ್ರಶಾಂತ

ಮತ್ತು
ಮುತ್ತು
ಮತ್ತೆ
ಹೊತ್ತು
ಪ್ರವಾಸಿಗರ
ಪ್ರವಾಸಿಗರು
ಮತ್ತೊಂದು
ಮರೆತು
ಸುತ್ತಲು
ನೀರಿನ

ಮಾಡಿಸಲಾಗುತ್ತದೆ
ಮಾಡಲಾಗುತ್ತದೆ
ನೀಡಲಾಗುತ್ತದೆ
ನಂಬಲಾಗುತ್ತದೆ
ಪೂಜಿಸಲಾಗುತ್ತದೆ
ಹಾಲನ್ನು
ಹೆಸರನ್ನು
ಮತ್ತು
ಭಕ್ತರು
ಮಾತ್ರ

ಅಳಿವನಂಚಿನಲ್ಲಿರುವ
ಅಳಿವಿನಂಚಿನಲ್ಲಿರುವ
ಮತ್ತು
ಮುತ್ತ
ಪಕ್ಷಿಗಳಿವೆ
ಪಕ್ಷಿಗಳಿಗೆ
ಕತ್ತೆ
ಪ್ರಾಣಿಗಳ
ಪ್ರಾಣಿಗಳು
ಮನೆಯಾಗಿದೆ

ದಿನಗಳಲ್ಲೂ
ದಿನಗಳಲ್ಲಿ
ಬೆಳಗ್ಗೆ
ಬೆಳಿಗ್ಗೆ
ಆಚರಿಸಲಾಗುತ್ತದೆ
ನೆರವೇರಿಸಲಾಗುತ್ತದೆ
ತೆರೆದಿರುತ್ತಿದ್ದು
ತೆರೆದಿರುತ್ತದೆ
ಮತ್ತು
ಮತ್ತೆ

ಸೇವಿಸುತ್ತಾರೆ
ಸಲ್ಲಿಸುತ್ತಾರೆ
ಆಹಾರಗಳನ್ನು
ಆಹಾರವನ್ನು
ಪ್ರವಾಸಿಗರಿಗೆ
ಪ್ರವಾಸಿಗರೂ
ಕೊಲ್ಲುತ್ತಾರೆ
ಮಾಡಬಹುದು
ಮಾಡುವುದು
ತಿನ್ನುತ್ತಾರೆ

ಇಲ್ಲಿ
ರಲ್ಲಿ
ಇಲ್ಲಿಗೆ
ಮಲಗಿರುವ
ಮಲಗಿರುವಂತಹ
ಒದಗಿಸುತ್ತದೆ
ತೋರಿಸುತ್ತದೆ
ಮಲಗುವ
ಎಲ್ಲಾ
ಎಲ್ಲರ

ಸೂರ್ಯಾಸ್ತಮಾನವನ್ನು
ಸೂರ್ಯಸ್ನಾನವನ್ನು
ಇಲ್ಲಿನ
ಇಲ್ಲಿಯ
ಇಲ್ಲಿ
ಸ್ವಾಗತವನ್ನು
ಕ್ರಾಂತಿಯನ್ನೇ
ನಲ್ಲಿ
ಶಾಸನವೊಂದನ್ನು
ಹೆಸರುಗಳನ್ನು

ಕಪ್ಪು
ಕೆಂಪು
ಕಟ್ಟು
ಕಪ್ಪುಕರಡಿ
ಇಲ್ಲಿ
ಇಲ್ಲಿನ
ಮತ್ತು
ಮತ್ತೊಂದು
ರಫ್ತು
ಬೆಕ್ಕು

ಬಿಳಿ
ಬಿಳಿಯ
ನಿರ್ಮಿಸಲಾಗಿದೆ
ನಿರ್ಮಿಸಲಾಗಿರುವ
ಮತ್ತು
ಮತ್ತೊಂದು
ವಸ್ತು
ನಿರ್ಮಿಸಲ್ಪಟ್ಟಿದೆ
ಪ್ರವಾಸಿಗರು
ಪ್ರವಾಸಿಗರನ್ನು

ದೊಡ್ಡ
ದೊಡ್ದ
ಪ್ರವಾಸಿಗರ
ಪ್ರವಾಸಿಗರು
ಇಲ್ಲಿ
ಇಲ್ಲಿನ
ಇಲ್ಲಿದೆ
ಇಲ್ಲಿಗೆ
ನಲ್ಲಿ
ದೊಡ್ಡದಾದ

ಸ್ಥಳದಲ್ಲಿರುವ
ಸಮೀಪದಲ್ಲಿರುವ
ಇಲ್ಲಿವೆ
ಇಲ್ಲಿಗೆ
ಅಲೀಗಢದಲ್ಲಿರುವ
ದೂರದಲ್ಲಿರುವ
ಪ್ರವಾಸಿಗರು
ಪ್ರವಾಸಿ
ರಸ್ತೆಯಲ್ಲಿರುವ
ಜಿಲ್ಲೆಯಲ್ಲಿರುವ

kat
(Georgian)

ძაღლია
ძაღლი
ძაღლის
ძაღლს
ძაღლები
ძალიან
ძაღლთან
აი
არ

კატა
კატები
არის

წიგნი
წიგნის
წიგნია
წიგნში
წიგნს
წიგნებია
წიგნები
წიგნების
წიგნმა
ისინი

პური
პურს
ვჭამ
ჭამს
მაქვს
ვიყიდე

წყალს
წყალი
წყლის

რძე
რძეს
რძისგან
სახლისკენ
მე

სახლში
სახლშია
სახლი
სახლიდან
სახლისკენ
ახლა
ლეილას
დარჩით
ისინი
დაბნელებამდე

დღე
დღეს
დღეა
დღეში
დღის
ყოველდღე
რამდენ
მე
ბარდება
ეს

ჭამს
ჭამას
ვჭამ
ვჭამთ
ჭამა
გიჭამია
გვიჭამია
მიირთვა
მიირთვი
დესერტი

მძინავს
სძინავს
გძინავს
დაეძინა
დაიძინა
ძინავთ
გვეძინა
ეძინათ
მეძინა
დასაძინებლად

კითხულობენ
კითხულობს
ვკითხულობ
წაიკითხა
წავიკითხავ
კითხვა

ძაღლი
შავია

არის
თეთრი

დიდი
სახლი
ის

პატარა
მდინარის
მახლობლად
სახლში
ტომი

lit
(Lithuanian)

šunis
šunys
šuns
šunį
šunų
šunims
šuo
šuniui
nusipirkau
nusipirkti

katės
katė
katę
kates
katinai
katinas
katei
kėdės
katiną
kam

knygą
knyga
knygų
knygas
knygos
knygoje
mokiniai
naudinga
laikai
yra

duoną
duona
duonos
duok
parduoda
kurią
nori
nuo
pikto
žinau

vandens
vandenį
vandeniu
vanduo
vienas
sunkesnis
sūresnis
daviau
kareiviai
negalėtume

pieno
pieną
pienu
pienas
geria
geriu
neduoda
išgerti
nori
palaukti

namo
namų
namie
namai
mano
esame
neeiname
neturite
mane
taip

dienų
dieną
diena
dienas
viena
dienos
dienoms
dienom
dirba
kasdien

valgyti
valgti
pavalgyti
valgėte
valgei
valgom
valgo
suvaglyti
nevalgo
nevalgė

miegoti
pamiegoti
miegojai
miegojo
miega
miego
miegu
miegi
miegojau
miegantį

skaityti
perskaityti
skaitai
skaityk
perskaitysi
perskatyti
skaitoma
neskaityk
skaitau
perskaitysiu

juodas
juoda
juodai
juodų
juodo
juodą
juodus
juodos
juokiasi
lova

balta
baltas
baltą
balto
matau
pabalo

didelis
didelias
didelių
dideli
didelė
didelį
viena

mažas
mažos
maža
mažame
matai
namas
maži
mažą
mažoje
labai

lvs
(Latvian)

suni
suns
sunim
suņi
mans
suņu
mani
manu
suņiem
sāka

kaķis
kaķus
kaķim
kaķi
kaķa
mazākais
raibais
tikai
kaklu
vairāk

grāmata
grāmatu
grāmatas
grāmatām
smaga
man
tava
ir
tā

maizi
maizes
rupjmaizi
kvass
esi
kas
ar
ir

ūdens
ūdeni
ūdenī
ūdenim
ūdenstilpē
minerālūdens
iedevu
gruntsūdeņus
putni
nedzeru

pienu
piena
piens
rūgušpienu
priekšroku
reta
nedzer
pazīstama
un
ir

mājās
mājām
atstājis
atstāja
nerunājam
aizmirsa
sekoja
savu
angliski
ej

dienu
diena
dienā
vienu
dienas
dienai
ēdiena
dienās
viena
dienām

ēstu
ēst
ēd
ēdu
mēs
ēdīs
ēdīsi
ēdis
neēdu
ēdīšu

gulēju
gulēšu
gulēja
gulēji
gulēt
pagulēt
gulēsi
guļot
guļ
guļu

lasīt
lasītu
lasot
grāmatas
grāmata
lasīja
lasīju
izlasītu
lasījis
lasu

melns
melnas
melnos
melnās
melna
melnā
melnu
melnai
melnie
melnajā

balts
baltu
baltā
balto
baltais
bars
melnas
bikses
tas
straumi

liela
lielu
lieli
lielā
liels
lielās
lielām
saule
saules
tai

mazas
maza
mazu
mazs
mazā
mana
manam
maziem
marija
redzama

mkd
(Macedonian)

кучето
кучево
кучиња
кучињата
куче
кучка
куки
кое
очекуваше
чуваше

мачкава
мачката
мачките
мачка
мачки
сака
сакам
кучиња
имам
таа

книгата
книгава
книги
книга
книгите
читаш
дека
магии
премногу
на

леб
леп
лебот
треба
ли
е
со
во
од
на

водата
водава
вода
додај
додека
воденица
навадам
доволна
создаде
да

млеко
млекото
млеково
мене
козјо
пиеме
смееме
колку
ако
може

дома
дом
мама
додека
том
има
домашните
одам
одиме
да

ден
дена
еден
денес
денов
денот
дедо
арен
две
дневно

јадам
јадат
јадеш
јаде
јадел
јадеме
јадеше
јади
јадење
јадено

спијат
спијам
спие
спиев
спиеш
спиел
спиеле
спиење
спиеше
спиј

прочитам
прочита
прочиташ
прочитал
прочитав
читам
чита
читаш
читал
читаше

црни
црно
црна
црн
црниот
том
тоа
црнец
црната
црнокос

црнобели
црнобело
бели
белци
бела
бело
бел
белата
белиот
врело

голема
големо
големи
голем
том
тоа
поента
помогна
главното
главната

мала
мали
мало
мал
малата
табла
премала
премали
премало
малечка

nob
(Norwegian Bokmal)

hunden
hundene
hunder
under
hund
hun
rundt
nesten
sovende
hans

katten
kattene
katter
katt
klatre
hater
etter
svarte
kanskje
elsker

bøker
bøger
bøkene
boken
bokas
boka
bok
noen
ønsker
denne

brød
brødet
dere
denne
drar
ludder
allerede
de
er
egentlig

vann
vanne
vannet
mannen
vanndamp
enn
var
renne
varer
plantene

melk
melken
melkekyr
melkeallergi
melkeproduksjon
drikker
i

hjem
hjemme
jeg
hele
komme
kommer
hvis
rette
der
deg

dagen
dager
dag
deg
ganger
klager
dagboken
leger
lang
gang

spiser
spise
spiste
spises
spist
spis
pisa
disse
spisesalen
pizza

sovet
sover
sove
sovende
hver
sov
ideer
søvn
ligger
som

lese
leser
leste
lest
eller
hele
eventyr
allerede
denne
disse

svart
svarte
sort
var
hvit
hatt
har
katten
hesten
en

hvit
hvite
hvitt
har
var
hest
katten
kanter
svart
vi

stort
stor
store
svart
som
etter
sett
et
er
en

liten
lite
litt
lille
gutten
den
en
enn
kvinnen
enden

ron
(Romanian)

câinele
câinelui
câine
câini
caine
câinilor
cine
câinii
inventat
nevoie

pisica
pisică
pisici
pisicii
pisicile
pipăit
scăpat
trecea
petrece
mănâncă

carte
cartea
aceasta
această
care
cărți
cărții
cărui
foarte
cărțile

pâinea
pâine
taie
pe
proaspătă
în
ai

apă
apa
apei
apus
proaspăt
piatră
puțină
luxoasă
pe
era

laptele
lapte
poate
alerga
turnat
ea
el
a

acasă
casă
casa
școală
astăzi
șase
acum
tatăl
meargă
rămas

zi
azi
duminică
duminica
zile
zilele
săptămâna
săptămânii
ai
fi

mănânci
mănânc
mănânce
mănâncă
mănâncăți
mânca
mâncat
mâncați
mâncăm
mâncare

doarme
doarmă
dormi
dormit
dormind
dormea
dorm
adorm
dormeau
dormeam

citit
citite
citito
citești
citește
citim
citesc
citească
citi
cartea

negru
negrul
negre
negri
neagră
afară
mereu
fiecare
erau
grup

alb
albă
albe
albi
ale
sau
astăzi
lebedele
umple
ca

mare
mari
are
marile
țară
tale
foarte
mărire
favoare
gaura

mică
mici
mic
este
ești
asta
există
acesta
acest
camera

go to homepage

The posts on this website are licensed under CC-by-NC 4.0.