codeslower.com Savor Your Code.

Exploring Oslo: Using Mg to Define a To-Do Language

Oslo, finally unveiled at this week's PDC is a "tool, a platform and a repository." Two new languages have been shipped so far -- "M" and "Mg" (or MGrammar). This article focuses on Mg.

Mg is a language for defining languages. You define the syntax for your language and it generates various artifacts for you. This article explores using Mg to define a "language" for writing to-do lists. Three versions are developed, each building on the last. Problems with each version are demonstrated and their solution is given in the next. A "challenge" problem is also given for each version. Readers are welcome to submit their solutions and they will be linked here. Links can also be provided in the comments.

As always, the source for the samples are available on GitHub:

git://github.com/m4dc4p/mg-todo.git

http://github.com/m4dc4p/mg-todo/tree/v10312008

You can pull the tag v103102008 to get the source code as it existed when this article was published. The hyperlink above points to the correct version of the repository.

Resources

  • The Oslo Developer Center is a must. It hosts the Oslo SDK, as well as many other resources.
  • The SDK's installation directory has a number of articles in its "Documentation" folder. Most relevant are:
    • Mg in a Nutshell
    • Mg Language Specification
  • Under "Samples" are two valueable files:
    • Mg.mg -- The Mg language, specified in Mg. Very useful for looking up how complex things can be done
    • M.mg -- The M language, specified in Mg. Again, useful for reference.
  • M coding conventions, though not specifically for Mg, has a lot of good advice.

Intellipad

Intellipad is the new editor shipped with Oslo, which has nice support Mg. It's still a bit buggy so follow these steps to open an Mg file:

  • From All Programs, launch Intellipad (Samples Enabled).
  • Hit CTRL+SHIFT+T to launch "Tree Preview Mode".
  • Select your .mg file.

You'll be presented with three panes along the top and one along the bottom. The middle pane is where your Mg file is opened. Any text entered in the left-hand pane will be parsed as specified by the Mg file. The right-hand pane shows "productions" from that parse and will not be treated here. The bottom pane shows any parsing errors.

I haven't determined a way to open a file in the left-hand pane and get it parsed according to the Mg specification. You'll need to copy your text from the file you want parsed and paste it in to the pane.

If you want to open another .mg file, you have to close Intellipad and do the steps above again or you'll get errors.

Update (November 7, 2008)

Quetzal Bradley of the Intellipad team sent in this note:

Regarding the difficulty of controlling the panes in MGrammar mode, we have some ideas to make this better in the next CTP. Meanwhile, Sara Wong wrote a blog post on creating a custom Intellipad command to set up the grammar, input, and treeview panes without requiring a file to start with. This same approach could be slightly modified to ask for a file to be the input instead of binding to the buffer the command is invoked from.

A To-Do List Language

We're going to define a simple language for representing tasks. Each task will be on one line in our file and will have a title and a due date:

task "Define a language for todo lists." 10-31-2008
task 'Make sure we can parse "complex" titles with quotes" 10-31-2008
task "Finally clean out the garage."

Tasks v1

Our first attempt allows us to define multiple tasks of the form:

task <task name> <due date>

We'll deconstruct the elements of the definition one by one. First, every language must be inside a module and language:

module ToDo
{
    language Tasks1 {
    ...
    };
};

Inside the language block we get to define our language. A language is defined as a series of rules, where each rule can match some element of the text being processed. Let's start by defining what a task should look like:

   syntax TaskLine = "task" Space+ Title Space+ DueDate EndTask;

This says a task line starts with the word task, followed by one or more space, followed by a title, followed by at least one space, a due date, and finally an end of task indicator. Let's fill out those definitions:

syntax Title = "\"" any* "\"";
syntax DueDate = Month "-" Day "-" Year;

A Title is a quote, followed by any number of characters, followed by a closing quote. The any keyword matches any character at all. This will present problems later but we leave it for now. A DueDate is a month, followed by a dash, followed by a Day, another dash, and finally a year.

syntax EndTask = '\r' | '\r''\n' | '\n';
syntax Space =  ' ';

EndTask defines the end-of-task terminator as a carriage return, carriage return and linefeed, or a linefeed alone. You can see a familiar C-like syntax for escaping CR and LF characters. Unicode escape sequences are also possible but not shown here. Space defines just what it says -- a space.

token Month = Digit#1..2;
token Day = Digit#1..2;
token Year = Digit#2..4;

Month, Day, and Year are used to make up a date and are all some number of digits. The token keyword is also introduced. It allows the #n..m expressions after Digit, which cannot be used in rules defined with the syntax keyword. There are other ramifications but they are not treated yet.

token Digit = "0".."9";

Finally, we define a digit as being a character from the range 0 -- 9. The .. allows ranges to be specified.

Taken as a whole, our definition is (which can be found in todo1.m and todo1.todo):

  module ToDo
  {
      language Tasks1 {
              syntax Main = TaskLine+;

              syntax TaskLine = "task" Space+ Title Space+ DueDate EndTask;
              syntax Title = "\"" any* "\"";
              syntax DueDate = Month "-" Day "-" Year;

              syntax EndTask = '\r' | '\r''\n' | '\n';
              syntax Space =  ' ';

              token Month = Digit#1..2;
              token Day = Digit#1..2;
              token Year = Digit#2..4;
              token Digit = "0".."9";
      }             
  }

Notice the Main rule. Each language must have one and only one rule named Main, which serves as the "root" definition for the language. Our definition specifies that our language is made up of one or more (the + operator) TaskLine elements.

Challenge -- Define other task types

The above uses the task keyword to introduce a task. Extend the language to allow different keywords. For example, todo, phone call or appt. For even more fun, allow different types of information to be on the line, depending on the type of task.

Reader Solutions

Douglas Finke sent in this snippet:

 syntax TaskLine = TaskType Title DueDate?;
 syntax TaskType = 'task' | 'appt' | 'phone call';

Check out his blog, linked above. He has some good posts exploring Intellipad.

What Doesn't Work

Now, enter a task definition like below:

task "Define a second task." 08-29-2008

You'll see that the parser complains about the "task" in the title. We'll fix that in the next version.

Tasks v2

The problem with the language above lies in this rule:

    syntax Title = "\"" any* "\"";

By using the any keyword, we are allowing any character at all, including double quotes. That makes determining where our task title ends unreliably. Mg determines that the "task" appearing in the task title could be the start of a new task, and complains.

To fix this problem, we need to be less liberal about what characters we allow in a title. We change our definition of Title to:

    token Title = '"' ^('"' | '\n' | '\r')* '"';

This definition says a title is a double quote, followed by sequence of any number of characters, which can be anything except a double quote, carriage return or a newline, followed by a closing double quote. The inverse operator ("^") matches any characters NOT specified. Our definition then excludes double quotes and newlines. We use the token keyword because inverse cannot be used with the syntax keyword.

Our last language was very explicit about where spaces and newlines can be found, but there is an easier way of doing it. The interleave keyword allows the characters specified to be appear between ANY syntax defined terms. This can have subtle effects but overall it seems worth it. We add this as:

    interleave Whitespace = '\r' | ' ' | '\n';

The name given ("Whitespace") doesn't seem to matter. interleave rules are not applied to patterns defined with token keyword. When explicit control over matching is needed, use token. In any case, TaskLine now has a very compact definition. No mention of spacing is needed:

    syntax TaskLine = "task" Title DueDate;

Our new language definition is then:

module ToDo
{
    language Tasks2 {
             syntax Main = TaskLine+;

             syntax TaskLine = "task" Title DueDate;
             syntax DueDate = Month "-" Day "-" Year;
             token Title = '"' ^('\n' | '\r')* '"'; 

             token Digit = "0".."9";
             token Month = Digit#1..2;
             token Day = Digit#1..2;
             token Year = Digit#2..4;

            interleave Whitespace = '\r' | ' ' | '\n';
    }
}

Challenge -- Excluding dangerous text

Extend the title definition above to exclude common HTML elements such as anchors or <OBJECT> tags. Take a whitelist or blacklist approach.

What Doesn't Work

The most obvious problem with our current definition is that we can't include quores in a task title:

    task "Create a task with "quotes"" 10-31-2008

An empty task file also presents problems. We'll fix them both in the next language.

Tasks v3

One way to handle quoted text is to change the character which represents quotes. That is, if we want to include double quotes, the title should be surrounded by single quotes. Similarly, if we want single quotes, the title should be surrounded by double quotes.

We first define Title to recognize either form:

    syntax Title = SingleQuotedText | DoubleQuotedText;

And then we could define SingleQuotedText and DoubleQuotedText as:

    token SingleQuotedText = "'" ^("'" | '\n' | '\r')* "'";
    token DoubleQuotedText = '"' ^('"' | '\n' | '\r')* '"';

These rules are almost identical, though. In each case the "quote" character is excluded from the text in the middle. Clearly those at Microsoft were anticipating this situation, as Mg allows us to define a parameterized rule:

    token QuotedText(Q) = Q (^('\n' | '\r') - Q)* Q;

Q represents our quote character. We still use the inversion operator to exclude newlines from task titles. The difference operator ("-") then subtracts the quote character from the list of allowable characters. This allows use to define SingleQuoteText and DoubleQuotedText as:

    token SingleQuotedText = QuotedText('"');
    token DoubleQuotedText = QuotedText("'");

where the particular quote excluded is passed as a parameter.

While we are here, we can also make our task definition a little less strict. Let's allow an empty file:

    syntax Main = TaskLine*;

The * operator means zero or more TaskLine elements can appear in our language. We can also make due dates optional:

    syntax TaskLine = "task" Title DueDate?;

By appending the ? operator to DueDate, we can omit the due date. We cannot have more than one due date, though, so our language still makes sense.

Our complete language specification now looks like (available in todo3.m and todo3.todo):

    module ToDo
    {
            language Tasks3 {
                     syntax Main = TaskLine*;
                     syntax TaskLine = "task" Title DueDate?;

                     syntax Title = SingleQuotedText | DoubleQuotedText;
                     syntax DueDate = Month "-" Day "-" Year;

                     token Digit = "0".."9";
                     token Month = Digit#1..2;
                     token Day = Digit#1..2;
                     token Year = Digit#2..4;

                     token SingleQuotedText = QuotedText('"');
                     token DoubleQuotedText = QuotedText("'");
                     token QuotedText(Q) = Q (^('\n' | '\r') - Q)* Q;

                     interleave Whitespace = '\r' | ' ' | '\n';
            }
    }

Challenge -- Task Descriptions

Add the ability to specify a multi-line description for a task in addition to the title. Make sure almost any character can appear in the description and try not to make it to burdensome on the user. Do not make descriptions required. That is, all previous task files should still parse.

Conclusion

Mg is clearly a powerful technology for defining languages. The above is a trivial use, but already we have bypassed most simple file parsing strategies. Quoted strings and balanced parentheses are the bane of non-parsing techniques and it appears the Mg will be placing those abilities within reach of most .NET developers.

As always, patches and feedback are welcome. You will also notice a few extra files in the source repository -- I'm glad to answer questions about any in the comments or via email.

Remember, if you do complete the challenges, please let me know!

Category: None

Please login to comment.

2 Comments

  1. Line Return Concern by Jon von Gillern (2008-11-07)

    Thanks for taking the time to post this, Mg sounds like it is going to be very cool.

    Creating a "syntax" element is really kind of neat because it looks like they're making Regular Expressions a little more verbose, but more readable to the average developer. I say this because the characters ^ | * + seem to work in the same way as Regex.

    If the team is in fact turning they're syntax notation into a regular expression you'll probably want to be careful when declaring the EndTask element and change it from

    'r' | 'r''n' | 'n'

    to

    'r''n' | 'r' | 'n'

    instead. This is because the regular expression engine is lazy and if it fed the string "rn" it would return two separate matches, one with the match "r" and another match "n" which would essentially be two "EndTask" elements. But if you use the later syntax declaration for EndTask, it will only have one match for "rn".

    Now the specifics of this example might not make a difference if there are extra "EndTask"'s but I could definitely see the possibility of getting some bizarre behaving code that doesn't have an obvious cause.

  2. Line Return Concern by Jon von Gillern (2008-11-07)

    It looks like the comments are getting stripped for front slashes, you should be able to figure it out :)