Skip to content

Support string literals in pattern matching #16

@dvanhorn

Description

@dvanhorn

The current implementation of pattern matching in Knock is incorrect for string literal patterns because the match boils down to an eq? test. All other literal patterns are immediates, but strings are not.

So for example (match "foo" ["foo" #t] [_ #f]) will produce #f in the current implementation. String literal interning would fix this particular problem, but won't fix the problem in general, as (match (make-string #\f 3) ["fff" #t] [_ #f]) would continue to produce #f.

A fix was discussed on Piazza:


Today in class we spent some time talking about what it would take to support string literal patterns correctly. The main issue is that unlike other kinds of literals we've considered so far, string literals are memory allocated (either statically or dynamically) and what it means to match is to be a string that is "equal" in the string=? sense to the string in the pattern; other literals are immediates, and thus can be compared with a single Cmp instruction.

I thought about it more on my bike ride home this evening and this is what I came up with.

I added a special case for matching the empty string because (assuming the empty string is represented by the type tag alone), it is actually an immediate:

;; Pat CEnv Symbol -> (list Asm CEnv)
(define (compile-pattern p cm next)
  (match p
    #;...
    [(Lit "")
     (list
      (let ((ok (gensym)))
        (seq (Cmp rax type-str)
             (Je ok)
             (Add rsp (* 8 (length cm)))
             (Jmp next)
             (Label ok)))
      cm)]))

So the next case applies only to non-empty string patterns. It statically allocates memory for the string literal of the pattern in a data section. The code then checks whether the value in rax is:

  • a string
  • not the empty string
  • of the same length as the pattern string

If all of these are true, then it's time to compare the contents of the two strings for equality. To do this, I call the C standard library function:

int memcmp(const void *str1, const void *str2, size_t n)

This function is given two pointers and a size (in bytes) and returns 0 if the pointers point to n bytes of equal data.

Here's the code:

;; Pat CEnv Symbol -> (list Asm CEnv)
(define (compile-pattern p cm next)
  (match p
    #;...
    [(Lit (? string? s))
     (list      
      (let ((string (gensym))
            (ok     (gensym))
            (bail   (gensym)))
        (seq (Data)
             (Label string)
             (map Dd (map char->integer (string->list s)))
             (if (odd? (string-length s)) (seq (Dd 0)) (seq))
             (Text)
             (Mov r8 rax)
             (And r8 ptr-mask)
             (Cmp r8 type-str)
             (Jne bail) ; value is not a string
             (Cmp rax type-str)
             (Je bail) ; value is the empty string (but pattern is not)
             (Xor rax type-str)
             (Mov r8 (Offset rax 0))
             (Mov 'rdx (string-length s))
             (Cmp r8 'rdx)
             (Jne bail) ; value is not an equal length string
             ;; time to compare the contents
             ;; let's use the C standard library
             (Extern 'memcmp)
             (Sar 'rdx 2) ; 4 bytes per character
             (Mov 'rdi rax)
             (Add 'rdi 8)
             (Lea 'rsi string)
             pad-stack
             (Call 'memcmp)
             unpad-stack
             (Cmp rax 0)
             (Je ok)                  
             (Label bail)
             (Add rsp (* 8 (length cm)))
             (Jmp next)
             (Label ok)))
      cm)]))

Some examples:

> (require "run.rkt" "parse.rkt")
> (run (compile (parse '(match "a"
                          ["a" 1]
                          [_ 2]))))
1
> (run (compile (parse '(match ""
                          ["" 1]
                          [_ 2]))))
1
> (run (compile (parse '(match "a"
                          ["" 1]
                          [_ 2]))))
2
> (run (compile (parse '(match ""
                          ["a" 1]
                          [_ 2]))))
2

And there you go.

One thing I think this code suggests is that making the empty string equal to type-str was a wrong move and better is to statically allocate a canonical empty string so that this code can be simplified and unified into a single case.


This should be incorporated into the Knock compiler, but after changing empty strings (and vectors) to be statically allocated so that this code can be simpler and in a single case. Similar code would be in a string=? primitive, so it probably make sense to define a function and call it from here and string=?.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions