Sheeraz Memon, Sana Hoor Arisar, Imran Ali JokhioAbstract- This paper defines the undertaking of automatic talker acknowledgment, describes briefly the possible applications and summarizes the conventional methods of talker acknowledgment. A general model of the talker acknowledgment methodological analysis consisting the preparation and proving phases is presented. Conventional methods used at each phase of the talker acknowledgment procedure are summarized. These phases include pre-processing, feature extraction, talker modeling and determination logic and testing. The concluding subdivision includes a brief reappraisal of address principal most frequently used in talker acknowledgment research.
Keywords-Speaker acknowledgment, Feature extraction, categorization, Evaluation techniques, Speech principal.
SPEAKER acknowledgment can be defined as the undertaking of set uping the individuality of an person from his/her voice. The ability of acknowledging voices of those familiar to us is a critical portion of unwritten communicating between worlds. Research has considered automatic talker acknowledgment since the early 1970 ‘s taking advantage of progresss in the related field of speech acknowledgment. The talker acknowledgment undertaking is frequently divided into two related applications: talker designation and talker confirmation. Speaker designation establishes the individuality of an single talker out of a list of possible campaigners. Speaker confirmation, on the other manus, accepts or rejects a claim of individuality from a talker. Speaker acknowledgment may be categorized into closed set and unfastened set acknowledgment depending on whether the acknowledgment undertaking assumes the possibility that the talker being identified may non be included on the list of possible campaigners. Speaker acknowledgment may be farther categorized into text-independent and text-dependent acknowledgment. If the text must be the same for development of the talker ‘s templet ( registration ) and acknowledgment ( proving ) this is called text-dependent acknowledgment. In a text-dependent system, the text can be either be common across all talkers ( e.g. : a common base on balls phrase ) or alone. Text-independent systems are most frequently used for talker designation. In this instance the text during registration and designation can be different.
Fig. 1 Major constituents of a talker acknowledgment system
Address samples from a Speaker
Parameter Optimization Procedure
Fig. 2 Registration or preparation of a talker acknowledgment system.
A conventional talker acknowledgment system illustrated in Fig.1, comprises of two major phases, registration or preparation procedure and acknowledgment or proving procedure. During the enrolment phase address samples from known talkers are used to cipher vectors of parametric quantities called the characteristic characteristics [ 1,2 ] . The characteristic vectors are so used to bring forth stochastic theoretical accounts ( or templets ) for each talker. Since the coevals of theoretical account parametric quantities is normally based on some sort of optimisation process iteratively deducing the best values of the theoretical account parametric quantities, the registration procedure is normally time-consuming. For that ground, the enrolment process is normally performed off line and repeated merely if the theoretical accounts are no longer valid. Fig. 2 shows a typical functional diagram of the preparation procedure.
The acknowledgment stage is conducted after preparation ; this is when the stochastic theoretical accounts for each category ( talker ) have been already built. During the acknowledgment stage, the talker acknowledgment system is exposed to speech informations non seen during the preparation [ 1,2 ] . Address samples from an unknown talker or from a claimant are used to cipher characteristic vectors utilizing the same methodological analysis as in the registration procedure. These vectors are so passed to the classifier which performs a form fiting undertaking finding the closest-matching talker theoretical account. This procedure consequences in a determination devising procedure which determines either the talker individuality ( in talker designation ) or accepts/rejects the claimant individuality ( in talker confirmation ) [ 3-6 ] . The acknowledgment phase is normally comparatively fast and can be done online in the existent clip conditions. Fig. 3 shows a typical block diagram of the acknowledgment stage for talker designation, whereas Fig. 4 shows the acknowledgment stage for talker confirmation.
Automatic talker acknowledgment is deriving credence in both authorities and fiscal sectors as a method to ease speedy and unafraid hallmark of persons. For illustration, the Australian Government organisation Centrelink already uses talker confirmation for the hallmark of Welfare receivers utilizing telephone minutess [ 8 ] . Potential applications of talker acknowledgment include forensics [ 7 ] , entree security, phone banking, web services [ 9 ] , personalization of services and client relationship direction ( CRM ) [ 10 ] . Biometric applications of talker acknowledgment provide really attractive options to biometries based on finger prints, retina scans and face acknowledgment. The advantages of talker acknowledgment over these techniques include: low costs and non-invasive character of speech acquisition, no demand for expensive equipment, possibility of geting the information without talker ‘s active engagement or even consciousness of the acquisition procedure. As an entree security tool, talker acknowledgment can potentially extinguish the demand for retrieving PIN Numberss and watchwords for bank histories and security locks and assorted online services [ 13 ] . The cardinal importance of address as a biometric in commercial applications is likely more deeply expressed by a patent held by IBM for the usage of address biometries in telephone applications every bit good as the on-going intense research in this country [ 11,12 ] carried by the IBM research workers.
Pre-processing and address analysis
The pre-processing phase used in talker acknowledgment [ 4,14 ] can include speech processing for noise remotion and sweetening ; it can besides include compensation for the channel deformation, pre-emphasis filtering to take effects of lip radiation every bit good as remotion of silence and in some instances unvoiced speech intervals. Each of these attacks provides betterments to speaker confirmation public presentation over telephony channels. During the pre-processing phase address is normally divided into short-time frames utilizing a windowing procedure and the subsequent characteristic extraction is performed on the frame-by-frame footing. The ground for a short-time attack to the characteristic extraction is based on the fact that a speech signal can be viewed as a piecewise stationary signal or a short-time stationary signal.
Fig. 3 Testing stage for talker designation.
Fig. 4 Testing stage for talker confirmation
In a short-time ( e.g.,10-30milliseconds ) , address can be approximated as a stationary procedure [ 15-17 ] . Feature vectors extracted from address on the frame-by-frame footing can hence be used to bring forth stochastic theoretical accounts utilizing attacks such as the Gaussian Mixture Model ( GMM ) or the Hidden Markov Model ( HMM ) .
The finding of the analysing window length depends on whether the analysis aims to pull out the address beginning, vocal piece of land features or long-run features ( e.g. word continuance, modulation, talking rate or speech pattern ) [ 18 ] . To obtain the information embedded in the vocal piece of land, address is analyzed utilizing segmental analysis with frames of length 10-30ms. In the scope of 10-30ms few pitch intervals can be captured supplying information about the vocal piece of land features [ 19 ] . The segmental analysis is the most widely used method to execute characteristic extraction for talker acknowledgment [ 16,17,37 ] . To obtain the information embedded in the excitement beginning sub-segmental analysis is used with speech frames of length 3-5ms [ 20 ] . The sub-segmental analysis is designed for capturing information within a individual pitch period. Examples of the sub-segmental address analysis are described in [ 21,22,23 ] . For supra-segmental analysis the address is analyzed utilizing the frames and convergence in between 100-300ms. This analysis method is appropriate to pull out the information due to behavioral traits. It includes word continuance, modulation, talking rate, speech pattern, etc. The information varying is comparatively slower for behavioral traits therefore big sized frames would function the intent. The supra-segmental analysis for address frames is used in [ 16,21,24,25 ] demonstrating that some behavioral traits can be captured with this analysis of address.
Feature extraction methods
The procedure of change overing a natural address signal into a sequence of acoustic characteristic vectors transporting features information about the talker is called characteristic extraction. Majority of current characteristic extraction methods in talker acknowledgment usage parametric quantities derived from the classical source-filter theoretical account. The classical source-filter theory of voice production assumes that the air flow through the vocal creases ( beginning ) and the vocal piece of land ( filter ) is unidirectional. During voice, the vocal creases vibrate. One quiver rhythm includes the gap and shutting stages in which the vocal creases are traveling apart or together, severally. The figure of rhythms per 2nd determines the frequence of the quiver, which is subjectively perceived as pitch or objectively measured as the cardinal frequence F0. The sound is so modulated by the vocal piece of land constellation and the resonating frequences of the vocal piece of land, known as formants. Finally the address signal is passed through the low-pass lip radiation filter which reduces the signal energies with frequence by about 6 dB/octave [ 26 ] . Thus the singularity of the talker specific information may be attributed to several factors such as the form and size of the vocal piece of land, kineticss of the articulators, rate of quiver of the vocal creases, speech pattern imposed by the talker and speech production rate. All these factors are reflected in the address signal, and therefore are utile for talker acknowledgment.
Despite legion surveies analyzing the beginning and extent of variableness in speech signal [ 27 ] , there has been no decision from the lingual, acoustic or forensic point of position, as to what constitutes a “ voice print ” . As a consequence a assortment of parametric quantities stand foring speaker-characteristic characteristics have been proposed and successfully applied in talker acknowledgment. Speech characteristics used in the classical applications of talker acknowledgment can be divided utilizing different standards. Based on the sphere in which the analysis is conducted [ 28,29 ] , the characteristic characteristics can be divided into:
spectral characteristics – forms of the short-run address spectrum, the spectral characteristics represent wholly or partly the physical features of the vocal piece of land ;
dynamic characteristics – clip fluctuations of other characteristics such as spectral characteristics ;
Prosodic features – refer to the cardinal frequence F0 and energy contours.
Based on the clip continuance of the analyzed address section, the prosodic characteristics can be divided into the undermentioned classs:
beginning characteristics – prosodic characteristics within a individual glottal period ;
suprasegmental characteristics – prosodic features crossing a few glottal periods ;
high-level characteristics – long clip characteristics crossing the clip continuance to a word or vocalization.
The spectral characteristics have been widely applied to the undertaking of talker acknowledgment. The proposed spectral methods include: Real Cepstral Coefficients ( RCC ) introduced in [ 33 ] , Linear Prediction Coefficients ( LPC ) proposed in [ 30 ] , Linear Predictive Cepstral Coefficients ( LPCC ) derived by Atal in [ 31 ] , and Mel Frequency Cepstral Coefficients ( MFCC ) derived by Davis and Mermelstein in [ 32 ] . Spectral characteristics such as the Linear Frequency Cepstral Coefficients ( LFCC ) [ 32 ] are similar to the MFCC nevertheless alternatively of the logarithmic Mel frequence spectral subdivision, a additive graduated table is used supplying every bit separated filters on the linear instead than logarithmic graduated table covering the full signal bandwidth. Other types of spectral characteristics include Perceptual Linear Prediction ( PLP ) coefficients [ 34 ] and the Adaptive Component Weighting ( ACW ) cepstral coefficients [ 35,36 ] . A survey by Reynolds [ 17 ] compared different spectral characteristics like MFCC, LFCC, LPCC and perceptual linear anticipation cepstral coefficients ( PLPCCs ) for talker acknowledgment. It was observed that the MFCCs and LPCCs gave significantly higher acknowledgment rates. From a perceptual point of position, MFCC bear resemblance to the human auditory system, since these characteristics account for the nonlinear nature of pitch perceptual experience. This is the primary ground of public presentation domination of MFCC characteristics. This success of MFCC combined with their robust and cost-efficient calculation, turned MFCC into a world in the speech/speaker acknowledgment applications. Recently a figure of qualifiers of MFCC are introduced and have shown better public presentation.
The characteristics which represent clip derived functions of the spectrum-based characteristics are referred to as the dynamic characteristics. Dynamic cepstral characteristics such as delta ( first derived function of cepstral characteristics ) and double-delta ( 2nd derived function of cepstral characteristics ) have been shown to play an indispensable function in capturing the transitional features of the address signal [ 37 ] . A set of new dynamic characteristics for talker confirmation system was introduced in [ 37 ] . These new characteristics, known as Delta Cepstral Energy ( DCE ) and Delta-Delta Cepstral Energy ( DDCE ) , can compactly stand for the clip changing cepstral information. Dynamic characteristics based on MFCC and LPCC were used in [ 39,41 ] . Shifted Delta Cepstrum was used in [ 42 ] for the talker acknowledgment and have shown promising consequences. A method utilizing statistical dynamic characteristics has late been proposed. In this method, a multivariate auto-regression ( MAR ) theoretical account is applied to the clip series of cepstral vectors and used to qualify talkers [ 40 ] . The merger of the cepstra and delta cepstra characteristics have been shown to supply comparatively good consequences for the undertaking of talker acknowledgment [ 37,38 ] . It has been demonstrated that the talker acknowledgment system public presentation may be enhanced by adding clip derived functions to the inactive parametric quantities.
Prosodic address characteristics, are frequently used to pull out the information about the speech production manner of a individual. The cardinal frequence, formants and the frame energy are the most normally known prosodic characteristics. These characteristics are besides frequently appended to their logarithmically compressed values and added to the spectrum-based address parametric quantities in order to obtain the better public presentation. The usage of the temporal derived functions of the cardinal frequence and the frame energy has besides remained in pattern. A set of statistical parametric quantities evaluated based on the temporal parametric quantities has besides established better public presentation for the talker acknowledgment systems. The characteristic extraction methodological analysis proposed in [ 43 ] introduces a figure of betterments to the appraisal of the cardinal frequence and speech pattern. These betterments include the re-synthesis of the pitch contour which removes the doubling/halving that occurs during the computation procedure of the cardinal frequence. The drawbacks of the prosodic characteristics include the fact that they can be easy mimic or imitated. A combination of prosodic information with the spectrum-based characteristics could take to a better public presentation and extinguish the possibility of characteristics being imitated.
In the past few old ages an increasing involvement has been observed on utilizing several information merger methods in talker acknowledgment [ 6,44-46 ] . The characteristic information merger can be seen in several signifiers, such as multi-feature merger and multi-sample merger [ 8 ] . A mark talker might be conditioned to express same phrase for a figure of times and the determination is therefore based on uniting the tonss [ 46 ] , this is called multi-sample merger. In multi-feature merger attack, same address vocalization is used to pull out different characteristics. As described above, MFCCs have appeared to be public presentation superior characteristic extraction method for talker acknowledgment. Dynamic features or characteristics extracted from prosodic information could be helpful when fused with spectrum-based characteristics like MFCC, but could non take to a province of the art design separately. Much more improved consequences are obtained utilizing combinations ( or mergers ) of characteristics. The additive anticipation ( LP ) remainder besides contains speaker-specific beginning information [ 22 ] which can heighten the public presentation of talker acknowledgment systems. It has been reported in [ 48 ] that a combination of the LP residuary with LPCC or MFCC improves the public presentation [ 47 ] . Plumpe et Al. in [ 27 ] developed a technique for gauging and patterning the glottal flow derivative wave form from address for talker acknowledgment. In his survey, the glottal flow estimation was modeled as coarse and all right glottal characteristics, which were captured utilizing different techniques. Besides, it was shown that the combined coarse and all right structured parametric quantities gave better public presentation than the single parametric quantity entirely [ 21,27 ] .
Speaker modeling and categorization techniques
The modeling techniques transform the voice characteristics of a talker to an indistinguishable representation. The aim of patterning technique is to bring forth talker theoretical accounts utilizing speaker-specific characteristic vectors. The Modeling techniques can be classified as discriminatory or productive as shown in Fig. 5 [ 29 ] . The discriminatory attacks include additive discriminant analysis ( LDA ) [ 49 ] , Polynomial Classifier [ 50 ] , Time-Delay Neural Networks ( TDNN ) [ 51 ] , Recurrent Neural Networks ( RNN ) [ 52 ] , Multilayer Perceptron ( MLP ) [ 53 ] , and Support Vector Machines ( SVM ) [ 54 ] . A major group of non-discriminative attacks is productive, it includes, Probabilistic Neural Network ( PNN ) [ 55 ] , Gaussian Mixture Models ( GMM ) and the Hidden Markov Models ( HMM ) .
The templet fiting techniques [ 31,56 ] were the most widely used techniques for talker acknowledgment at the early phases of this engineering. In this attack preparation and proving characteristic vectors are straight compared utilizing similarity step. For the similarity step, any of the techniques like spectral distance or Euclidian distance or Mahalanobis distance is used.
Fig. 5 Major Modeling attacks for talker acknowledgment.
The modern classifiers used in talker acknowledgment engineering include Gaussian Mixture Models ( GMM ) [ 4 ] , Hidden Markov Models ( HMM ) [ 57 ] , Support Vector Machines ( SVM ) [ 59 ] Vector Quantization ( VQ ) [ 58 ] , and Artificial Neural Networks ( ANN ) [ 60 ] . The HMMs are largely used for text-prompted talker confirmation, whereas GMM, SVM, VQ attacks are widely used for text independent talker acknowledgment applications. The GMM is presently recognized as the province of art modeling and categorization technique for talker acknowledgment [ 4 ] . The GMM theoretical accounts the chance denseness map ( PDF ) of a characteristic set as a leaden amount of multivariate Gaussian PDFs. It is tantamount to a individual province uninterrupted HMM, and may besides be interpreted as a signifier of soft VQ [ 61 ] . The Support Vector Machines ( SVM ) has been used in talker acknowledgment applications in the past decennary ; nevertheless the betterments of public presentation over the GMM were merely fringy [ 59,62 ] . A combined categorization attack including SVM and GMM was reported to supply important betterment over GMM [ 63 ] . Assorted signifiers of the Vector Quantization ( VQ ) methods have been besides used as categorization methods in talker acknowledgment [ 64,65 ] . The most common attack to the usage of VQ for talker acknowledgment is to make a separate codebook for each talker utilizing the talker ‘s preparation informations [ 65 ] . The talker acknowledgment rates based on the VQ were found to be lower than those provided by the GMM [ 66 ] . As stated above, VQ may be considered as a “ soft signifier ” of VQ, taking that similarity, a combination of the VQ algorithm and a Gaussian reading of the VQ talker theoretical account were described in [ 67 ] . In [ 68 ] , the Vector Quantization was combined with the GMM method supplying important decrease of the computational complexness over the GMM method. Matusi et Al. [ 64 ] , compared the public presentation of the VQ categorization techniques with assorted HMM constellations. It was found that uninterrupted HMM outperformed distinct HMM and that VQ based techniques become most effectual in the instance of minimum preparation informations. Furthermore, the survey found that the province passage information in HMM architectures was non of import for text-independent talker acknowledgment. This survey provided a strong instance back uping the usage of the GMM classifier since a GMM classifier can be interpreted as a HMM with merely a individual province. The Matsui et. Al. findings were farther supported by Zhu et. Al. [ 61 ] who found that HMM based talker acknowledgment public presentation was extremely correlated with the entire figure of Gaussian mixtures in the theoretical account. This means that the entire figure of Gaussian mixtures and non the province passages are of import for text-independent talker acknowledgment. The ANN techniques have legion architectures and a assortment of signifiers have been used in the talker acknowledgment undertaking [ 69 ] . The several ANN signifiers include Multi-Layer Perceptron ( MLP ) Networks, Radial Basis Function ( RBF ) Networks [ 73 ] , Gamma Networks [ 60 ] , and Time-Delay Neural Networks ( TDNN ) [ 70 ] . Fredrickson [ 71 ] and Finan [ 72 ] conducted separate surveies comparing the categorization public presentation of RBF and MLP webs. In both surveies, the RBF webs were found to be superior. The RBF web was found to be more robust in the presence of imperfect preparation conditions due to its more stiff signifier. In other words, the RBF web was found to be less susceptible over preparation than the MLP web. It was shown that some of the nervous web constellations can supply consequences comparable with the GMM [ 74 ] .
The above literature study indicates that the GMM provides the best acting classifier for talker acknowledgment undertaking. For that ground, a figure of most recent surveies have been focused on the betterments of the classical GMM algorithm [ 67,75,76 ] . Any direct comparing of conventional talker acknowledgment architectures is hard due to fluctuation in the preparation and proving conditions, computational complexness of classifiers and characteristic extraction methods and types of address informations. The quality and figure of address samples used in the preparation and testing can hold a important impact on the public presentation of talker acknowledgment systems.
Classifier merger schemes have besides been used to obtain improved acknowledgment consequences. The merger at classifier degrees combines the lucifer scores to obtain the concluding determination [ 77 ] . The characteristic extraction scheme is same for the multiple classifiers. Thus the merger can look in one of the two signifiers. By uniting the characteristics at the frame degree into a vector for which a individual theoretical account is trained, or by patterning each characteristic set utilizing a separate classifier.
Performance rating methods
The mistake rates for talker acknowledgment system were ab initio measured utilizing receiving system runing characteristic ( ROC ) curves [ 78 ] . However in the more recent surveies of the talker acknowledgment systems, the nonlinear ROC curves are replaced by the Detection Error Trade-off ( DET ) plots [ 79 ] , which are believed to supply more efficient representation of the system public presentation because of their additive behaviour in the logarithmic co-ordinate system.
Speaker acknowledgment systems like any other acknowledgment system necessitate a determination logic methodological analysis to measure the system public presentation. During rating stage the trial characteristic vectors are matched with the mention theoretical accounts of the preparation database. This duplicate procedure leads to a lucifer mark. There are two types of possible mistakes in talker confirmation, false credence mistake besides known as the false dismay chance and the false rejection mistake [ 19,78,80 ] , besides known as the miss chance. A false credence ( or faithlessly dismay ) mistake occurs when the system accepts a claim of individuality from an imposter talker. A false rejection ( or lose chance ) mistake occurs when the system rejects a legitimate talker as an imposter. Equal mistake rate is defined as the rule value to make up one’s mind the system public presentation ; the EER value is the mistake rate at which false dismay chance is equal to lose chance. The DET secret plans are related to the equal mistake rate ( EER ) parametric quantity stand foring a normalized step of the system mistake rates. An illustration of a DET secret plan is shown in Fig. 6. As illustrated in Fig. 6, the equal mistake rate can be determined diagrammatically as the per centum of false rejection ( or false credence ) at the intersection point between a 450 line and the sensing mistake tradeoff ( DET ) curve. The smaller is the EER for a given talker confirmation system, the better is the system public presentation.
Speech principal for talker acknowledgment research
The choice of suited address principal is of cardinal importance in proving the public presentation of developed talker acknowledgment techniques. Ideally, the database used for public presentation rating should reflect environmental features determined by possible applications. Practical talker acknowledgment systems are typically used in non-ideal environments including acoustic noise and telephone line set restrictions. In add-on, most applications involve acknowledging an person at a ulterior day of the month so the day of the month of the provided address sample, hence dependability over a long period of clip is of import.
Fig. 6 DET ( sensing error trade-off ) secret plan.
Looking at the possible commercial applications of a talker acknowledgment system and in peculiar telephone-based talker acknowledgment, the following key demands for address principals can be identified:
Speech recorded over a telephone line [ 81 ] with the talker in natural environment ;
The clip continuance of a individual recording session should be at least 60 seconds.
The information should be recorded for each talker during a figure of Sessionss spaced in clip and covering a important clip interval ( at least 1 twelvemonth ) ;
The principal should incorporate address samples from a sufficiently big figure of talkers ;
The principal should incorporate talkers utilizing the same linguistic communication ;
The recording conditions should be good documented and the address samples right labeled to avoid abuse of informations.
Publicly available informations comes from assorted commercial and academic beginnings and has been produced for a broad assortment of applications and developed under different conditions. Although, in the recent old ages the NIST database [ 89 ] has become the most often used principal, other informations sets are still being used as they can supply public presentation rating across different entering environments, populations of talkers and different linguistic communications. NIST launches an rating program every twelvemonth and as a consequence provides a address principal to execute the experiments. Apart from NIST a broad scope of address principal is available such as TIMIT [ 82 ] , SIVA [ 83 ] , POLYCOST [ 84 ] , YOHO [ 85 ] , POLYVAR [ 86 ] , KING [ 87 ] , and switchboard address principal [ 88 ] .
In this paper we defined the talker acknowledgment undertaking, its possible applications and summarized the conventional methods of talker acknowledgment. A general model of the talker acknowledgment methodological analysis consisting the preparation and proving phases is presented. Conventional methods used at each phase of the talker acknowledgment procedure are explained. The phases include pre-processing, feature extraction, talker mold, categorization determination devising and methods of measuring the talker acknowledgment public presentation. A brief reappraisal of address principals often used for the talker acknowledgment research is besides given. The current state-of-the-art talker acknowledgment systems uses segmental analysis for address analysis, MFCC and its discrepancies as characteristic extraction methods, GMM as modeling technique and DET secret plans to measure the system public presentation. The NIST SRE principals are widely used to measure the proposed methods and promotions in the field.