

Syntactic and semantic word segmentation and labelling in a given text of a large corpus, is one of the basic research
activities to produce a linguistic database for the sake of language modelling. In this paper, the author explains the difficulties encountered to manage such an activity in the project "a feasibility study for Farsi language n'lOdelling".
Several linguistic criteria and one engineering criterion were used to handle the difficulties. Finally, based on an n-state n1arkov process (n=O, 1,2,3), a software package is written to extract Farsi words conditional probabilities distributions for both labels-dependent and independent cases.
