jump to navigation

Line-at-a-Time Interactive Input March 12, 2008

Posted by PythonGuy in Advanced Python, Python, Text Handling.
add a comment

Let’s write a line-at-a-time interactive console.

Here’s how I would write it without knowing much about Python:

#!/usr/bin/env python

import sys

for line in sys.stdin:
  print "You wrote:", line
print "Goodbye!"

Except that doesn’t work. See, the input on stdin is buffered. That means, until a certain amount of input is reached, no lines are generated. Looking around, there is no way I can see to turn off buffering. Ugh.

Now, don’t get me wrong. The above method is by far the best way to process line-at-a-time files, just not interactive input. I would use it for cases when you are piping in the output of another process into this Python process, of for cases where you are loading a file.

But for reading a terminal (or for that matter, reading sockets) it’s not the best way to go, thanks to buffering.

Well, how does one do this the Python way? I saw some notes on raw_input(), but I didn’t want to go there. (I end up there anyway. You’ll see.)

What about Logix? Logix is a language written in Python, and they have an interactive console just like Python’s. How did they do it?

Looking in their code, they simply subclassed code.InteractiveConsole, and overrode some features. In particular, they messed with sys.ps1 and sys.ps2 (being careful to remember what they were and restoring them when they were finished.) But as far as the InteractiveConsole is concerned, they pretty much overrode the runsource method with their own.

So, the line-at-a-time input program, at a minimum, would look like this:

#!/usr/bin/env python

import code

class InteractiveConsole(code.InteractiveConsole):
  def runsource(self, source, filename='', symbol='single'):
    print "You wrote:", source
    return False # return True if you want more

print "Goodbye!"

This is much better. If you are writing a language interpreter for a new language you are writing in Python, then this is a great place to start. It really is a natural fit.

But, what if you want something simpler? Well, raw_input() is really the way to go.

#!/usr/bin/env python

while True:
    line = raw_input()
  except EOFError:
  print "You wrote:", line
print "Goodbye!"

This is about the most basic form you can have.

I experimented with converting the above to a generator so you can write for line in input():, but I realized that it’s not really practical. Sometimes you may want to change the prompt, or do something else. So that’s not the best form for line-at-a-time interactive work.

One more note: If you want the full, interactive console with history, you’ll have to import readline. This is for both the raw_input method and the code.InteractiveConsole method. Note that if you are already in the interactive console for Python, readline may already be imported.


Defining a Grammar January 18, 2006

Posted by PythonGuy in Text Handling.
add a comment

Let’s cover the basics of defining a grammar.

To define a grammar, you have to know the following:

  • Identify the terminal symbols. That is, which symbols represent actual text in the language.
  • Identify the non-terminal symbols. These will refer to other (terminal or non-terminal) symbols.
  • Identify the starting symbol. This represents the entire language. Anything that can be generated from this is a valid expression in the language. This may or may not be a terminal symbol, and almost always is not.
  • Identify a set of rules.

The last part is somewhat difficult. Each rule takes an ordered list of one or more non-terminal symbols and converts it into a list of zero or more terminal or non-terminal symbols on the right. Each rule is optional; that is, if it can be applied, it doesn’t have to be. There has to be one or more rules that takes the start symbol alone on the left-hand side. Otherwise, the language isn’t defined.

You can express any grammar with these rules. It is pretty complete. But how do you write out these rules?

A common form is EBNF (Extended Backus-Naur Form). It is pretty expressive and yet easy to read. You can read more about it at Wikipedia, including online references. (See EBNF at Wikipedia.)

Now, if you write out your grammar in EBNF, you can quickly determine what type of grammar it is.

If you can write it out so that only one symbol appears on the left-hand side, then you have successfully obtained a type-2 grammar (context-free) that can be parsed by a computer.

If you can write it out so that non-terminal symbols appear only as the last item on the right-hand side, then you have a type-3 grammar (regular).

If you can reduce it to a single optional set of terminals, then you have a type-4 grammar.

Try your best to reduce it as much as possible. The more you can reduce it, the easier your life will be.

If you end up with a type-3 grammar, you’re going to want to rewrite it as a regular expression. I’ll cover RE’s in a later post.

Text Processing January 18, 2006

Posted by PythonGuy in Text Handling.
1 comment so far

Text processing is a problem you almost always run into in programming. What they don’t tell you is that there are variety of useful tools.

I don’t admire Noam Chomsky’s politics, but he pioneered the field of understanding language and grammar from a computational viewpoint. The basic theory is that a language is actually a set of letters and a set of rules describing which sequences of which letters are valid. This set of rules is called the grammar. Leaving semantics aside for now, let’s focus on the problem of determining whether an expression is part of a language or not.

Chomsky names 4 levels of grammar that cover the types of languages you will encounter. Starting at the most expressive are Type-0 languages, which are so complicated that the only types of languages that fit in this classification are natural languages (English, French, etc…) and a few theoretical exercises. Type-1 languages are called Context-Sensitive because how rules are interpreted depends on what is around them. Needless to say, these languages, like Type-0, are very difficult to parse correctly. Type-2 languages are named context-free because none of the rules depend on context. You can get very, very far with a type-2 language. Almost every programming language and a significant portion of most natural languages can be described with a type-2 grammar. Type-3 languages are named regular and are special because any uncertainty in a rule is preserved only at the end. You might’ve heard of regular expressions. These are actually (or used to be) a shorthand for describing a type-3 language.

Some people talkĀ  about an additional type, type-4. Chomsky didn’t define this type, but it has a lot of use. It is a language that has a grammar that says, “Pick one of these options.” This is severely limited, but for text processing, we sometimes encounter languages that adhere to these rules. You can also think of an additional class of languages where there is only one option to chose from. Naturally, these are very restrictive and have very little use.

Each lower-numbered language incorporates all the abilities of a higher-numbered language and then some. So, all type-3 languages can be expressed as a type-2 language.

Now, hopefully you are thoroughly confused by now. But let’s compare what we know to how to tackle problems we face in real life.

Let’s suppose you are writing a program that is looking for specific strings. This is very much like a type-4 language. But in real life, you may need to search within a string for that specific string. So, you’ll be using a type-3 language. The grammar will read something like, “Anything followed by what I am looking for followed by anything.”

Let’s suppose that you need to do something more complicated, like identify a string as a number that can have a complicated form. You should try to work out the rules for this in a way as to conform to type-3 grammars. If you run into a wall (normally identified by nesting expressions, such as “(5+(3*6+4))”, then you will have to use a type-2 grammar.

If you know you need to go beyond regular grammars, then you are in type-2 land. This is a very complicated realm, one where significant thought and experimentation needs to be done. It’s not totally impassable, but it is difficult. So you must don a thinking cap and get into research mode before proceeding.

If you need context-sensitive grammars, then you are beyond the ability of a computer. It’s best at that point to rethink the problem or understand and note its limitations.

Now, a word on semantics. Semantics is simply attaching a meaning to the grammar. For instance, in English, one style of a sentence is Subject-Verb. This is good if you want to identify something as English, or generate an English sentence. But it isn’t the complete picture. What does a sentence mean? Well, here you have the problem of semantics. You must capture what part of the text was the subject, and which part was the verb, and then identify the meanings of both, and take appropriate action.

This semantical component sounds difficult but in practice it really isn’t. Python grammar is interpreted into a parse tree of basic commands. Each command takes as its argument child trees. Together, they get evaluated depending on what the node is and what its children are. It’s really quite simple when you see it in action.

To conclude, I’ve covered that the problem of handling text can be very hard. But I’ve also shown you the general boundaries and the lay of the land. Your program is going to be able to either handle type-2 (with a lot of work on your part), type-3, or type-4 grammars. And applying semantics isn’t as difficult as you might expect.