Advanced Distributed Systems and CSV Format

Abstraction

The Comma Separated Values ( CSV ) file format is the most common import and export format for spreadsheets and databases. Surprisingly, while this format is really common, there is no general criterion specification of CSV in being, which allows for a broad assortment of readings of CSV files [ 6 ] . Due to miss of complexness of CSV file format, CSV has been widely adopted and became favorite, as it is supported by major spreadsheet formats ( i.e. Excel, Open Office etc ) . These all spreadsheet applications support importing and/or exportation of CSV file format. Many database applications such that MYSQL, Oracle, Microsoft Access besides support CSV, leting informations in CSV files to be populated into database or frailty versa [ 10 ] . There are many programming linguistic communications that support CSV. C++ , Java and Ruby have different libraries for authorship, reading and pull stringsing CSV files [ 7 ] . Besides there are informations transition available for change overing CSV file to other formats such that XML, XLS, PDF, DOC etc

Although many CSV files are simple to parse, the format is non officially defined by a stable specification and is elusive plenty that parsing lines of a CSV file with something like line.split ( “ , ” ) A is finally bound to neglect.

This undertaking supply a post-hoc specification of CSV after understanding the bing execution of CSV, together with related plans such as parsers. The specification can take many signifiers e.g. it could be a BNF ( Backus-Naur Form ) grammar or EBNF ( Extended Backus-Naur Form ) which is used to expressA context-free grammars. To bring forth parser for formal specification in Java, ANTRL grammar will be used in this undertaking. The other portion of this undertaking is to implement a Java plan which convert CSV file to XML file. The plan will take a CSV input ( i.e. input.csv ) and will change over it into XML ( i.e. output.xml ) by utilizing XML library.

Recognition

“ I love deadlines. I love the wooshing sound they make as they fly by. ”

Douglas Adams, 1952__2001

I wish to admit the undermentioned people for their function in the completion of this thesis:

My undertaking supervisor, Dr Tom Ridge, for his counsel, constructive proofreading, and many fruitful treatments. Thank you for assisting me to bring forth a thesis I can be proud of.

My parents, for ever back uping my picks in life, and merely on occasion pecking me to complete up and acquire a existent occupation that pays money.

The whole section and staff of Computer Science at University of Leicester.

Chapter 1: Introduction

CSV “ comma separated values ” file format is frequently used to interchange informations between disparate applications. For illustration, a CSV file might be used to reassign information from a database plan to a spreadsheet. The CSV file format is utile by OpenOffice, Calc and Microsoft Excel spread-sheet applications. Many other applications support CSV in some manner, to import or export informations. Surprisingly, while this format is really common, there is no general criterion specification of CSV in being, which allows for a broad assortment of readings of CSV files [ 6 ] .

Sometimes it is besides called comma Delimited. CSV ( comma separated values ) is a specially formatted field text file which shops spreadsheet or basic database information in simple file format. In the table each record is located on a separate line, delimited by a line interruption CRLF and each field within that record separated by a comma from following field value [ 2 ] . For Example

abdominal aortic aneurysm, bbb, ccc CRLF

thirty, yyy, zzz CRLF

Due to miss of complexness of CSV file format, CSV has been widely adopted and became favorite, as it is supported by major spreadsheet formats ( i.e. Excel, Open Office etc ) . These all spreadsheet applications support importing and/or exportation of CSV file format. Many database applications such that MYSQL, Oracle, Microsoft Access besides support CSV, leting informations in CSV files to be populated into database or frailty versa [ 10 ] . There are many programming linguistic communications that support CSV. C++ , Java and Ruby have different libraries for authorship, reading and pull stringsing CSV files [ 7 ] . Besides there are informations transition available for change overing CSV file to other formats such that XML, XLS, PDF, DOC etc.

In twelvemonth 2007, the figure of hunt hits was highest for CSV format from Google. It provides an indicant that there is still batch of CSV files information on the Web [ 7 ] .

CSV is arguably the most of import ( non-XML ) file format for structured informations. For illustration, bing databases ever back up import and export of informations in CSV format, but seldom in other formats such as Excel ( .xls ) files.

Motivation:

The chief motive for choosing this undertaking is to get good apprehension of formal specification of CSV and the demand of formal specification. This will supply a formal specification of CSV that other people can utilize as a mention when building executions. Hopefully this will take to fewer jobs.

Undertaking Aim:

The purpose of this undertaking is to supply a post-hoc specification of CSV after understanding the bing execution of CSV, together with related plans such as parsers. The specification can take many signifiers e.g. it could be a BNF ( Backus-Naur Form ) grammar or EBNF ( Extended Backus-Naur Form ) which is used to expressA context-free grammars, or a functional parser execution or any mathematical text.

Undertaking Objectives/Requirements:

Essential:

The Essential aims for the undertaking are:

Description of 3 bing CSV executions

Analysis of difference between executions

Designation of chief format picks

Understanding of basic CSV specification

Execution of parser for basic CSV specification

Extensions to include “ , ” ( requires citing of entries )

Recommended:

The Recommended aims for the undertaking are:

Formal BNF Grammar

Extension to handle different comma centrifuge

Different record eradicator

Extensions to include quoted values ( where the centrifuge e.g. ‘ , ‘ appears in the value itself )

Optional:

Optional Aims are:

Implement a Parser for other linguistic communications ( C # , C )

Reasonably pressman for other linguistic communications ( C # , C )

CSV to XML transcribers

Personal challenges:

At the beginning of this undertaking, I had a limited cognition about comma separated values, its bing execution and formal specifications. A huge literature reappraisal in comma separated values format was required. In order to bring forth a parser for comma separated values, new accomplishments and tools had to be learnt. I foremost learnt how to compose a formal grammar utilizing BNF and Extended BNF. Afterwards I learnt and used ANTLR grammar to bring forth Javas parser for basic CSV specification.

Undertaking challenges:

The chief challenges presented by undertakings are to understand the bing executions and specification of Comma separated values and to place difference between them.

To understand formal specification of comma separated values e.g. BNF grammar, functional parser.

The other challenge is to implement the CSV specification and besides to implement back uping plan in programming linguistic communication such that Java, C # .

Application and platform Support:

CSV is a simple informations exchange format that is good supported in most spreadsheet and database applications [ 8 ] . Popular spreadsheet applications are:

Microsoft Excel

OpenOffice spreadsheet

Org Calc

Gnumeric

Many database applications such that MYSQL, Oracle, Microsoft Access besides support CSV [ 10 ] .

Supportive Programming Languages:

There are many programming linguistic communications that support CSV. C++ and Java have different libraries for authorship, reading and pull stringsing CSV files [ 7 ] . Besides there are informations transition available for change overing CSV file to other formats such that XML, XLS, PDF, DOC etc.

The followers is the tabular array of programming linguistic communication support for the comma separated values arrange [ 8 ] .

Language

Language Tools Used

Java

Several free CSV tools exist:

SuperCSV

OpenCSV

Flatpack

JavaCSV

QN CSV

CSVReader/Writer

CSVFile

ExcelCSV

CsvReader

C/C++

Free Tools:

CSV faculty,

bcsv,

CSV reading and manipulationLibraries:

libcsvA ( LGPL )

libcsv_parser++A ( New BSD License ) A certification

PHP

fgetcsv ( ) map, A fputcsv ( ) map, orA parseCSV

.Net

FileHelpersA – An Automatic File Import/Export Framework by Marcos Meli ( LGPL ) A Blog

Flat File CheckerA – Data proof tool that supports CSV files.

Fast CSV ReaderA by Sebastien Lorion. Open Source category ( MIT license ) .

CSV Reader

MATLAB

csvread, A dlmread

Ocular Footing

ParseCSV

Table 1.1. Programing linguistic communications that support CSV Format [ 8 ]

Chapter 2: Literature Survey

[ 1 ]

RFC 4180

[ 2 ]

Repici

[ 3 ]

Excel

[ 4 ]

CSV Reader

[ 5 ]

Super CSV

CSV Specification:

With no standard format specification and the huge assortment of execution available, continuing CSV files can be a challenge since there is no maestro, important CSV formal specification. [ RFC ] Is an Informational IETF papers that describes the format for the “ text/csv ” A MIME typeA registered with theA IANA. [ Repici ] Provides item apprehension of CSV specification and format fluctuation although it is non documented and are non intended to be used outside their applications. There is no formal important specification for the CSV file format, [ RFC ] and [ Repici ] effort to stipulate a common definition for CSV file format. They are the complete paperss available in depicting CSV file format. The regulations described in [ Repici ] are largely compatible with [ RFC ] except [ 7 ] :

It allows the usage of line field ( LF ) , or passenger car returns and line provender brace ( CRLF ) as line interruption.

It is non required to hold the same figure of columns across the full file.

Leading and draging infinites in the Fieldss are ignored.

[ CSVReader ] Is fluctuation of CSV file format, it uses individual quotation marks or apostrophes as flight characters alternatively of dual quotation marks.

[ RFC ] is an effort to stipulate common regulations for CSV file format. Although [ RFC ] is merely informational and is non yet a standard path specification, it is an official RFC papers reviewed and approved by Internet Engineering Steering Group ( IESG ) . [ RFC ] appears to be the most suited papers to be used for format processing at this minute.

Executions:

CSV file uses comma to divide values although there are some executions of CSV import/export tools in which other centrifuges are used. Simple execution of CSV will non let comma or other particular characters inside the field value i.e. new line nevertheless more refined executions permits comma and other particular character in field value. Many of CSV executions uses “ dual quotation marks ” characters around values that contain characters i.e. comma, dual quotation marks, newline etc [ Excel ] . Some CSV executions may utilize anA flight characterA such as aA backslashA to encode reserved characters as an flight sequence [ Repici ] .

CSV format is originally developed by Microsoft for [ Excel ] , the execution specially [ Excel ] are de facto criterion for CSV. [ Excel ] simpleness makes it one of the most supported formats among all spreadsheet and database package.

In [ CSVReader ] and [ SuperCSV ] a standard CSV execution is given. [ CSVReader ] supply how to Read, compose, and majority insert common file formats like CSV, XLS, and XML from C # and

VB.Net. [ SuperCSV ] is programmer friendly free CSV bundle for Java. It is alone characteristics raise the saloon and put a new criterion for CSV bundles. [ SuperCSV ] has ability toA easy convertA input/output to whole numbers, day of the months, large decimals, paring strings, etc

The implicit in execution of [ SuperCSV ] has been written in an extendable manner, therefore new readers/writers and cell processors can easy be supported.

Problems with deficiency of specification:

Due to miss of individual specification, there is considerable difference among executions. This can take to interoperation troubles and has resulted in eternal jobs at all degrees of IT e.g. mistakes, bugs, plan failures. The format is operationally explained by many applications which read and write it. A The deficiency of a criterion means that little differences frequently exist in the information produced and consumed by different applications. These differences can do it raging to treat CSV files from multiple beginnings. Still, while the delimiters and citing characters vary, the overall format is similar plenty that it is possible to compose a individual faculty which can expeditiously pull strings such informations, concealing the inside informations of reading and composing the information from the coder [ 25 ] .

Execution Problem:

Different executions of CSV have different jobs e.g. [ SuperCSV ] has the job of direction of the information set. [ Excel ] has prima nothing and infinite job. It will ever take taking nothings from Fieldss before exposing them. [ Excel ] will ever take taking infinites.

For Example:

00300, 03456, No_action, 06

This may chop the taking nothing in some execution e.g. [ Excel ] even if it is quoted.

Chapter 3: Background

A file format is a particular manner that information is encoded for storage in computing machine file. For different sort of information there is different sort of format. Specifically, files encoded utilizing the CSV formats are used to storeA tabular informations. CSV format is greatly used to go through informations between computing machines with different internal word sizes, informations arranging demands. For this ground, CSV files are often used on all computing machine platforms. InA computing machine scienceA look, a CSV file is called a “ level file ” [ 10 ] . Flat file is aA field textA or assorted text andA double star fileA which normally contains oneA recordA per line or ‘physical ‘ record ( illustration on discA orA tape ) . Within such a record, the singleA fieldsA can be separated by delimiters, e.g.A commas, or have a fixed length [ 11 ] .

CSV file uses comma to divide values although there are some executions of CSV import/export tools in which other centrifuges are used. Simple execution of CSV will non let comma or other particular characters inside the field value i.e. new line nevertheless more refined executions permits comma and other particular character in field value. Many of CSV executions uses “ dual quotation marks ” characters around values that contain characters i.e. comma, dual quotation marks, newline etc [ 10 ] .

Some CSV executions may utilize anA flight characterA such as aA backslashA to encode reserved characters as an flight sequence [ 3 ] .

For parsing of CSV, BNF grammar will be used. BNF ( Backus-Naur Form ) grammar is used to show context free grammar. It is widely used as aA notationA for theA grammarsA of computerA scheduling linguistic communications, A direction setsA andA communicating protocols, every bit good as a notation for stand foring parts of natural languageA grammars [ 12 ] .

In information engineering, parser is a plan normally portion of complier which checks for right sentence structure and physiques aA informations structureA ( frequently some sort ofA parse tree, A abstract sentence structure treeA or other hierarchal construction ) implicit in the input tokens [ 13 ] .

Following are few good known tools to bring forth parser [ 14 ] .

ANTLR ( Generates a recursive descent parser inA C, C++A orA Java )

Java Compiler Compiler ( JavaCC )

COCO/R ( Lexer and Parser Generetors )

Byacc/J ( Parser Generator, it can generateA JavaA beginning codification )

LEMON ( LALR1 parser generator )

GOLD Parser ( generates parsers that use a Deterministic Finite Automaton ( DFA ) )

YaYacc ( generatesA C++A parsers utilizing an LALR ( 1 ) algorithm )

CUP ( Parser generator for Java )

Jflex ( Parser generator for Java )

BNF Grammar:

The specification of CSV can take many signifiers ; one of them is BNF ( Backus-Naur Form ) grammar to specify sentence structure of linguistic communication. BNF is a Meta sentence structure used to show context-free grammar. It is formal manner to picture formal linguistic communications i.e. finite strings of letters, symbols, or items. John Backus and Peter Naur construct a context free grammar to specify the sentence structure of a programming linguistic communication. BNF is greatly used as a notation for the grammar of computing machine scheduling linguistic communications, direction sets and communicating protocols, in add-on a notation for stand foring parts of natural linguistic communication grammar [ 12 ] .

BNF is a formal mathematical manner to depict the sentence structure of the scheduling languages. It consist of

A set of terminal symbols

A set of non terminal symbols

A set of production regulations of the signifier

Such that:

Left-Hand-Side: := Right-Hand-Side

Where LHS is a non-terminal symbols and RHS is a sequence of symbols ( terminal or non-terminal ) . Production regulation is that the non-terminal on the left manus side possibly replaced by the look on the right manus side [ 18 ] .

As an illustration, see the undermentioned BNF for Leicester Postal Address:

& lt ; address & gt ; : := & lt ; name & gt ; & lt ; street & gt ; & lt ; postcode & gt ;

& lt ; name & gt ; : := & lt ; initial & gt ; ” . ” & lt ; firstname & gt ; & lt ; lastname & gt ; & lt ; EOL & gt ; | & lt ; initial & gt ; ” . “ & lt ; lastname & gt ; & lt ; EOL & gt ;

& lt ; street & gt ; : := & lt ; housename & gt ; & lt ; streetname & gt ; & lt ; EOL & gt ;

& lt ; postcode & gt ; : := & lt ; townname & gt ; ” , ” & lt ; zipcode & gt ; & lt ; EOL & gt ;

This translates in English as:

An reference consists of a name, followed by a street, followed by ZIP code.

A name consist of either: initial followed by a point, followed by first name, followed by last name and terminal of line, or a initial followed by a point, followed by last name and terminal of line.

A street portion consists of house name, followed by street name and terminal of line.

A station codification portion consists of town name followed by a comma, followed by nothing codification and terminal of line.

The following Table of conventions is description of BNF, it can assist while reading the grammar regulations [ 19 ] :

Symbols

Description

& lt ; word & gt ;

A word in angle bracket.

The word presented in angle brackets is non meant to be literally used in a statement ; its regulations are farther defined elsewhere.

& lt ; word & gt ; : :=

A word in angle bracket followed by: :=

A definition, or BNF “ production. ” The symbolA : :=A can be interpreted to intend “ is defined as. ”

|

The pipe symbol or “ OR ” symbol.

Precedes options. The symbolA |A can be interpreted to intend “ or. ”

Word

Text in all caps.

A query-grammar keyword, to be typed literally.

Table 2: Standard BNF Notations

There are many extensions and discrepancies of BNF, including Extended and Augmented Backus-Naur Forms ( EBNF and ABNF ) [ 1 ] . For bring forthing parser of CSV, BNF and EBNF will be used.

EBNF Grammar:

There are many different versions and extensions of BNF, by and large either for intent of simpleness, or to conform it to a specific invocation. The Drawn-out Backus-Naur Form ( EBNF ) is a frequent 1 among all discrepancies [ 12 ] . Most programming linguistic communications criterions use EBNF to specify the grammar of the linguistic communication. It is most refined signifier if BNF.

Extended Backus-Naur Form is a household of Meta sentence structure notations used to show context free grammar: that is, a formal manner to depict computing machine programming linguistic communication and formal linguistic communications [ 20 ] .

BNF had the job that options and repeats could non be straight expressed. A Alternatively, they needed the usage of an intermediate regulation or alternate production defined to be either nil or the optional production for option, or either the perennial production or itself, recursively, for repeat. The same concepts can still be used in EBNF [ 20 ] .

For Example:

In EBNF

Signed figure = [ mark ] , figure ;

It can be defined in BNF manner as:

& lt ; Signed figure & gt ; : := & lt ; mark & gt ; & lt ; figure & gt ; | & lt ; figure & gt ; ;

The basic symbols and notations of EBNF are same as BNF with an add-on of following regulations:

Symbols

Description

{ word }

Curl Brackets enveloping some word or point

A repeat component.

[ word ] or

[ WORD ]

Square brackets enveloping some word or point.

An optional component.

( word )

Parenthesis

Keep the symbols ( terminuss and non-terminals ) in the parentheses brackets together.

[ , & lt ; word & gt ; aˆ¦ ]

A comma, a word, and an eclipsis, all enclosed in square brackets.

You can optionally add on a comma-separated list of one or moreA & lt ; words & gt ; .

;

Semicolon

Marks the terminal of a regulation.

,

Comma

Concatenation

“ aˆ¦ . ”

Terminal strings

Table 3: Standard EBNF Notations

ANTLR Parser Generator:

ANTLR ( Another tool for Language acknowledgment ) is a parser generator written in Java, it provides a frame work for building recognizers, translators, compilers and transcribers. ANTLR takes as input a grammar that specifies a linguistic communication and generates as end product, beginning codification for a recognizer for that linguistic communication. At the minute, ANTLR supports bring forthing codification inA C, A Java, A Python, A C # , andA Objective-C. A linguistic communication specified utilizing aA context free grammar which is expressed utilizing Extended Backus Naur FormA EBNF [ 22 ] .

ANTLR allows for coevals of parsers, lexers, tree parsers and combined lexer parsers. Parsers can automatically bring forth Abstract Syntax Trees which can be farther processed with tree parsers. ANTLR provides a individual consistent notation for stipulating lexers, parsers and tree parsers. This is in contrast with other parser/lexer generators and adds greatly to the tool ‘s easiness of usage [ 21 ] .

For Example:

vocalization: salutation | exclaiming ;

salutation: ejaculation topic ;

exclaiming: ‘Hooray! ‘ ;

ejaculation: ‘Hello ‘ ;

topic: ‘World! ‘ ;

It will bring forth the undermentioned sentence structure diagram:

Figure 1. Syntax Diagram of grammar in ANTLR

This grammar matches the input “ Hello World ” and “ Hurrah! ”

By default ANTLR takes a grammar and generates a recognizer for that grammar that is a plan that takes an input watercourse and generates an mistake if the input watercourse does non conform to the sentence structure specified by the grammar. If there are no syntax mistakes so the default action is to merely go out without publishing any message. In order to make something utile with the linguistic communication, actions can be attached to grammar elements in the grammar. These actions are written in the scheduling linguistic communication that the recognizer is being generated in. When the recognizer is being generated the actions are embedded in the beginning codification of the recognizer at the appropriate points. Actions can be used to construct and look into symbol tabular arraies and to breathe instructions in a mark linguistic communication, in the instance of a compiler [ 22 ] .

In ANTLR we have the advantage of being able to utilize EBNF notations, that ‘s why to compose parser grammar we must hold basic apprehension of EBNF and its notations. The basic difference between ANTLR and EBNF notation is given below in tabular array.

EBNF

ANTLR

‘a ‘ zero or one time

[ a ]

a?

‘a ‘ zero or more

{ a }

a*

‘a ‘ one time or more

a { a }

a+

Table 4: Notation Difference between EBNF and ANTLR

For Example in EBNF:

[ { heading entry } newline ] newline

Will be tantamount in ANTLR as:

( ( header entry ) * newline ) ? Newline

Difference between Lexer and Parser regulations in ANTLR:

The usage of EBNFs does n’t separate between lexer and parser, merely between terminal and non-terminal symbols. Terminal symbols depict the input, while non-terminal symbols describe the tree construction behind the input. In pattern, terminal symbols are recognized by a lexer and non-terminal symbols are recognized by a parser. ANTLR imposes the convention that lexer regulations start with an uppercase missive and parser regulations with a lowercase missive. Lexer regulations contain merely either misprints ( along the usage of EBNF symbols ; misprints can be both individual characters and longer strings ) or mentions to other lexer regulations. Parser regulations may cite parser and lexer regulations as they wish and even include misprints, but ne’er merely misprints. See following sample grammar [ 23 ] :

For Example:

decl: ‘int ‘ ID ‘= ‘ INT ‘ ; ‘ // E.g. , “ int x = 3 ; ”

| ‘int ‘ ID ‘ ; ‘ // E.g. , “ int ten ; ”

;

Idaho: ( ‘a’..’z ‘ | ‘A’..’Z ‘ ) + ;

INT: ‘0’..’9’+ ;

Expressions and Operators in ANTLR:

Operators:

The operators, in order of precedency, are [ 24 ] :

Boolean negation

Not ( ~ )

Unary adding operators

+/-

Multiplying operators

*/Mod

Binary adding operators

+/-

Relational operators

= / = & lt ; & lt ; = & gt ; & gt ; =

Logical operators

And Or ( & A ; / | )

Table 5: Operators used in ANTLR

Expressions:

For binary operators, both operands must be the same type. Similarly, for assignment compatibility, both the left and right sides must hold the same type. Objects are considered to hold the same type merely if they both have the same type name. Therefore, two distinguishable type definitions are considered different even if they may be structurally indistinguishable. This is known as “ name equality ” of types.

Chapter 4: Specification and Design

Basic Specification of CSV:

There are assorted specifications for the CSV format [ 3 ] , [ 4 ] , [ 5 ] but there is no formal specification in being, Variations between CSV executions in different plans are rather common and can take to interoperation troubles at all degrees of IT e.g. mistakes, bugs, plan failures. Most implementation complies with the undermentioned regulations [ 6 ] , [ 7 ] .

1. Each column/field is separated by a comma character.

For Example:

Name, category, capable CRLF

2. Each row/record is separated by a line interruption CRLF ( ASCII 13 Dec or OD Hex and ASCII 10 Dec or 0A Hex severally ) for Windows, LF for Unix and CR for Mac. The last row may non be ended with a line interruption [ 9 ] .

For Example:

1997, hammad, Msc CRLF

1986, Masood, Msc

3. William claude dukenfields with embedded commas must be surrounded by dual quotation mark characters.

For Example:

1997, hammad, ” Msc, hm136 ” CRLF

4. William claude dukenfields with embedded line interruptions CRLF must be surrounded by dual quotation mark characters.

For Example:

1997, hammad, ” I am CRLF

Computer pupil ” CRLF

5. Spaces are considered portion of field informations and should non be ignored.

For Example:

1997, hammad, Msc

Is non same as

1997, hammad, Msc

6. Double quotation marks embedded in a column must be preceded ( escaped ) by another dual quotation mark.

For Example:

“ 1997 ” , hammad, ” Msc ” ” hm136 ” ” CS ”

6. Each row must incorporate the same figure of columns across the full file. There should be no draging comma for the last column of each row.

Hammad, hm136, Master of Science CRLF

7. An optional heading stand foring column names may look at the first row. The optional heading must follow the same regulations as described above.

For Example:

Field_name, field_name, field_name CRLF

1997, hammad, Msc CRLF

Generating Grammar in ANTLR utilizing Basic specification

In order to bring forth parser in ANTLR we foremost have to compose BNF and EBNF grammar for basic CSV specification.

Our chief ends for BNF, EBNF grammar are as follows:

Recognize Fieldss delimited by commas

Acknowledge records delimited by line interruptions

Recognize quoted Fieldss

Bing able to parse quotation marks, commas and line interruptions inside quotation marks

Ensure that all records had the same figure of Fieldss

[ 24 ] State Diagram for Parser

Field Start:

At the starting of every field. The white infinites should be consider for unquoted Fieldss, but any white infinite before quoted field is discarded.

Convention:

Lapp as unquoted field.

Quoted:

Any character inside a quoted field.

Post Quoted:

After a quoted field, white infinites could look between a quoted field and following record, and should be discarded.

Rule Dependency Graph for Parser generated in ANTLR

The regulation dependence graph for the grammar written in ANTLR after parsing will look like:

In this undertaking, we implemented two regulations in this ANTLR grammar in order to bring forth parser i.e. line and field. Inside Line regulation at that place will be NEWLINE, COMMA or field. There can be many fields/record inside one line.

The other regulation is Fieldss which can be QUOTED or UNQUOTED. Normally in ANTLR we specify regulations with little letters.

Example in ANTLR

Let ‘s seek to run CSV information with basic specification in ANTLR translator environment utilizing regulations described in grammar we made.

For Example

hammad, masood, ” hammad, masood ” , ” computing machine CRLF

scientific discipline ” , good work

This will bring forth undermentioned Syntax record

It will successfully put to death the above sentence structure diagram. we can see both regulations are implemented, there can be many Fieldss or records in one line and all will be separated by comma delimiter.

ANTLR Parser Java Code:

ANTLR will bring forth the parser codification of grammar written in BNF, EBNF utilizing some sentence structure. The parser in Java will look like:

import org.antlr.runtime. * ;

import java.util.Stack ;

import java.util.List ;

import java.util.ArrayList ;

public category superParser extends Parser {

public inactive concluding String [ ] tokenNames = new Stringing [ ] {

“ & lt ; invalid & gt ; ” , “ & lt ; EOR & gt ; ” , “ & lt ; DOWN & gt ; ” , “ & lt ; UP & gt ; ” , “ ‘
‘ ” , “ ‘ ! ‘ ” , “ ‘a ‘ ” , “ ‘z ‘ ” , “ ‘A ‘ ” , “ ‘Z ‘ ” , “ ‘ , ‘ ”

} ;

.

.

.

.

.

.

}

gimmick ( RecognitionException re ) {

reportError ( rhenium ) ;

recover ( input, rhenium ) ;

}

eventually {

}

return ;

}

// $ ANTLR end “ line ”

// Delegated regulations

Design for whole Undertaking:

Informal Grammar

BNF, EBNF

Input signal

CSV

Java

Execution

I

C #

Execution

ANTLR Parser Generator

XML

Library

CSV to XML

Converter

End product

XML

Figure 1. Design of Core functionality of the Undertaking

Test Suite:

I will be utilizing JUnit as our trial model. I will compose JUnit trials to exert the chief functionality of the parser executions, and the CSV to XML converter.

For illustration:

Sample CSV files, with expected XML end product ; JUnit trial will run the parser on the sample CSV and compare it against the XML end product.

Besides, I will compose trials to specifically look into for:

Trial Name

Description

Pass / Fail

Comma Between values

William claude dukenfields are separated with comma

N/A

New line

Entries are separated by newline Character.

N/A

CRLF

Entries can besides be separated by CRLF.

N/A

Single Word

For back uping individual field

N/A

Multiple Wordss

For back uping multiple Fieldss

N/A

Quoted strings

William claude dukenfields that contains particular characters ( newline, return, double-quote, comma, infinite ) is surrounded by dual quotation marks.

N/A

Quotation mark Escaping

A Using A ” ” A to representA ” A

N/A

Space Removal

Remove infinites around commas

N/A

Table 6. Table of different instances for CSV format

Testing against execution:

I will take big files from excel etc and parse them ; I will besides transform these files e.g. by infixing commas in values etc so salvaging the CSV as a file and trying to import back into excel etc.

Chapter 5: Decision

Comma separated values register format originally developed by Microsoft for its Excel merchandise [ “ ” ] . It is a bequest file format that has long been used as an exchange format among assorted applications. its simpleness makes it one of the most supported formats among spreadsheet and database package.

Unfortunately, Microsoft has ne’er openly published the specification for Comma separated values register format. This has led to different executions of CSV file format. For illustration, some executions merely use carriage return ( CR ) as new lines, some use line provender ( LF ) and the other usage line provender plus passenger car return ( LFCR ) . Some executions, such as Microsoft Excel, ignore prima infinites and nothings [ 7 ] .

Challenges Tackled:

The chief challenges, I have been tackled in this undertaking are:

Description and apprehension of bing executions

Analysis of difference between executions

Understanding of basic CSV specification

Formal specification of CSV

Designation of chief format picks

Formal BNF and EBNF grammar

Execution of little parser in ANTLR for basic CSV specification

Drawn-out Activities:

The drawn-out work which I am traveling to make in this undertaking as compared to initial program is:

CSV to XML transcriber

This activity was non in the initial program but I want to heighten my undertaking by adding CSV to XML transcriber. It was an optional demand for my undertaking.

Bibliography

[ 1 ] Government Publication “ Text Qualified CSV Files ” hypertext transfer protocol: //govpubs.lib.umn.edu/stat/csv.phtml

[ 2 ] Ohio State University “ what is CSV ” hypertext transfer protocol: //8help.osu.edu/1701.html

[ 3 ] Repici, J. , “ HOW-TO: The Comma Separated Value ( CSV ) File Format ” , 2004.

hypertext transfer protocol: //www.creativyst.com/Doc/Articles/CSV/CSV01.htm

[ 4 ] Edoceo, Inc. , “ CSV Standard File Format ” , 2004, hypertext transfer protocol: //www.edoceo.com/utilis/csv-file-format.php

[ 5 ] Rodger, R. and O. Shanaghy, “ Documentation for Ricebridge CSV Manager ” , February 2005, & lt ; hypertext transfer protocol: //www.ricebridge.com/products/csvman/reference.htm & gt ; .

[ 6 ] RFC 4180, “ Comma Format and MIME Type for Comma-Separated Values ( CSV ) Files ” , Y.Shafranovich, SolidMatrix Technologies, Inc. , October 2005 hypertext transfer protocol: //tools.ietf.org/html/rfc4180 # ref-4

[ 7 ] Carol C.H. Chou “ Action Plan Background: CSV ” hypertext transfer protocol: //www.fcla.edu/digitalArchive/pdfs/action_plan_bgrounds/CSV_bg.pdf

[ 8 ] Wikipedia “ CSV Application Support ” hypertext transfer protocol: //en.wikipedia.org/wiki/CSV_application_supportA

[ 9 ] Research Implementation “ CSV Reader.com ” hypertext transfer protocol: //www.csvreader.com/csv_format.php

[ 10 ] Wikipedia “ Comma-Separated Valuess ” hypertext transfer protocol: //en.wikipedia.org/wiki/Comma-separated_values

[ 11 ] Wikipedia “ Flat file Database ” hypertext transfer protocol: //en.wikipedia.org/wiki/Flat_file_database

[ 12 ] Wikipedia “ BNF grammar ” hypertext transfer protocol: //en.wikipedia.org/wiki/BNF_grammar

[ 13 ] Wikipedia “ What is Parsing ” hypertext transfer protocol: //en.wikipedia.org/wiki/Parsing

[ 14 ] ” Free Complier building tools ” hypertext transfer protocol: //www.thefreecountry.com/programming/compilerconstruction.shtml

[ 15 ] Microsoft Office “ Import and export text files ” hypertext transfer protocol: //office.microsoft.com/en-us/excel-help/import-or-export-text-files-HP010099725.aspx

[ 16 ] Kasper B. Graversen “ Super CSV ” , 2006-2009 hypertext transfer protocol: //supercsv.sourceforge.net/index.html

[ 17 ] Open Office, “ Calc/features/Numbers Import for field text files ” hypertext transfer protocol: //wiki.services.openoffice.org/wiki/Calc/Features/Numbers_import_for_plain_text_files # CSV_import_options_dialog

[ 18 ] A elaborate papers on BNF, EBNF and their differences, hypertext transfer protocol: //www.cs.umsl.edu/~janikow/cs4280/bnf.pdf

[ 19 ] Understanding BNF Notations, hypertext transfer protocol: //download.oracle.co m/docs/cd/E12032_01/doc/epm.921/html_techref/maxl/dml/rules/bnf.htm

[ 20 ] Wikipedia “ Extended Backus-Naur Form ” , hypertext transfer protocol: //en.wikipedia.org/wiki/Extended_Backus-Naur_Form

[ 21 ] ANTLR.org, hypertext transfer protocol: //www.antlr.org/

[ 22 ] Wikipedia “ ANTLR ” , hypertext transfer protocol: //en.wikipedia.org/wiki/ANTLR

[ 23 ] ” Quick starting motor on Parser Grammar ” , hypertext transfer protocol: //www.antlr.org/wiki/display/ANTLR3/Quick+Starter+on+Parser+Grammars+-+No+Past+Experience+Required

[ 24 ] hypertext transfer protocol: //javadude.com/articles/antlrtut/

[ 25 ] ” CSV file reading and Writing ” , hypertext transfer protocol: //docs.python.org/library/csv.html

[ 24 ] ” Parsing CSV in Erlang ” , hypertext transfer protocol: //ppolv.wordpress.com/2008/02/25/parsing-csv-in-erlang/

Leave a Reply

Your email address will not be published. Required fields are marked *