Follow @RoyOsherove on Twitter

Using Regex to return the first N words in a string

Jeff Perrin needed a function to return the first N words in a string (to create a small summary or a snippet thingy). He did it using the manual and awkward method of parsing the string manually. That method is more error prone and usually makes for less readable code. Fortunately, you can use regular expressions here quite nicely. Here's a test that makes sure that we get the first 4 words in a string and the function "FindFirstWords" that does this very easily using a simple regular expression.

What I'm doing here is that I'm using the expression to find the first 4 occurrences of text that is composed of alphanumeric text with one or more spaces after it. Then I simply iterate over the match I found. The match should contain 4 captures inside it - one for each "word" that was found.

It's not fully tested as you can see. I only wrote one test to see it works on this sort of sentence. More tests could and should be added to test other cases. In fact, if this were reall TDD, I would have started with a test of an empty string, and continued on to test getting only one word, and then two and so on.

[Test]

public void TestRegexFindFirstNWords()

{

      const string INPUT =

"this is word four five six seven eight nine ten eleven twelve thirteen!";

      const int NUM_WORDS_TO_RETURN = 4;

 

      string output = FindFirstWords (INPUT, NUM_WORDS_TO_RETURN);

 

      string expectedOutput = "this is word four ";

      Assert.AreEqual(expectedOutput,output);

}

 

private string FindFirstWords (string input, int howManyToFind)

{

     // thanks to Jeff Attwood for making this code even simpler!

      string REGEX = @"([\w]+\s+){" + howManyToFind + "}";

      return Regex.Match(input,REGEX).Value;

}

Full source for The Regulator released

Return of the Extensibility Application Block