Search the web
Sign In
New User? Sign Up
ocaml_beginners · Ocaml Beginners
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
processing a file with ocamllex   Topic List   < Prev Topic  |  Next Topic >
Reply | Forward  | 
Re: "ocaml_beginners"::[] processing a file with ocamllex

citromatik wrote:
> Hi all,
>
> I'm trying to parse a plain text file containing multiple records separated
> by a "//". A record sample can be viewed
> http://metagenomics.uv.es/gbexample.txt here .
>
> Extracting simple fields like the "ACCESSION" number is quite simple:
>
> let LLeter = ['A'-'Z']
> let digit = ['0'-'9']
> let space = ' '
> let ACCESSION = "ACCESSION" space+ LLeter LLeter digit+
> rule gb = parse
> | ACCESSION { ...; gb lexbuf }
> | _ { gb lexbuf }
>
> ...But, what about those multiline records? how can I extract them?
> I've tried using '#'. For example, for obtaining a full "REFERENCE":
>
> let endline = '\n'
> let KWD = endline LLeter+
> let REFERENCE = "REFERENCE" _+
> rule gb = parse
> | ACCESSION { ...; gb lexbuf }
> | REFERENCE#KWD { print_endline (Lexing.lexeme lexbuf); gb lexbuf }
> (* Line 16 *)
> | _ { gb lexbuf }
>
> But this gives me an error when trying to run ocamllex on it:
>
> File "genbank.mll", line 16, character 67: character set expected.
>
> What is this "character set expected" error?

a#b means "any char from a that does not belong to b".

Your KWD does not represent a set of chars.

You can write:

let az = ['a-'z']
let x = az # ['d'-'h']


> Is there a better (well, good) way to parse the multiline fields?

Maybe someone already wrote an OCaml parser for Genbank (I know I don't have a
complete one, if any; you may ask on caml-list).


Solution 1: don't use ocamllex at all

Process the file line by line (input_line is fine).
Create yourself a fast test_string_prefix function, "fast" boiling down to not
using String.sub.

Then write a pure OCaml parser whose input is the stream of lines. This is not
very different from solution 2 below, which I would choose.


Solution 2: use only ocamllex

Here is the structure of a reasonable parser:

{
type record = {
mutable locus : ...;
mutable definition : string option;
mutable accession : ...;
...
}

let new_record () = {
locus = None;
definition = None;
accession = None;
...
}

let newline lexbuf =
...
(* would set the correct line count
for useful error messages *)
}

rule top record = parse
"LOCUS " ...
{ ... }
| "DEFINITION " ([^'\r' '\n']+ as text) '\r'? '\n'
{
newline lexbuf;
let def_text = continue_definition [text] lexbuf in
if record.definition <> None then
... (* error: multiple DEFINITION fields *);
record.definition <- Some def_text
}
| "ACCESSION " ...
{ ... }
...
| "//" '\r'? '\n'
{
newline lexbuf;
Some record
}
| eof
{
... ; (* check that the current record is empty *)
None
}
| ""
{ (* report error *) }


and continue_definition accu = parse
" " ([^'\r' '\n']+ as text) '\r'? '\n'
{
newline lexbuf;
continue_definition (text :: accu) lexbuf
}
| ""
{ String.concat " " (List.rev accu) }


{
let rec scan process_record lexbuf =
match top (new_record ()) lexbuf with
None -> ()
| Some x ->
process_record x;
scan process_record lexbuf
}




Martin

--
http://mjambon.com/



Thu Mar 12, 2009 3:28 pm

BioMim
Offline Offline
Send Email Send Email

Forward
 | 
Expand Messages Author Sort by Date

Hi all, I'm trying to parse a plain text file containing multiple records separated by a "//". A record sample can be viewed ...
citromatik
miguel.pignatelli@...
Send Email
Mar 12, 2009
10:35 am

... a#b means "any char from a that does not belong to b". Your KWD does not represent a set of chars. You can write: let az = ['a-'z'] let x = az # ['d'-'h'] ...
Martin Jambon
BioMim
Offline Send Email
Mar 12, 2009
3:31 pm

... I ended up using option number 1 from Martin. When I wrote that parser it was my first real experience with parsers, and was having many problems parsing...
avlondono
Offline Send Email
Mar 12, 2009
10:35 pm

... Hi Martin, Thanks a lot for your answer, I end up coding something similar, but delegating to the parser the addition of multi-line fields. Something like:...
citromatik
miguel.pignatelli@...
Send Email
Mar 13, 2009
1:06 pm
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help