-
Notifications
You must be signed in to change notification settings - Fork 14
Move equivalence testing to bpe-openai #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
crates/bpe/src/byte_pair_encoding.rs
Outdated
| for _ in 0..8 { | ||
| // pick a random token and provisionally add it | ||
| let i = thread_rng().gen_range(0..bpe.num_tokens()); | ||
| bytes.extend(bpe.token_bytes(i as u32)); | ||
| // test if the additional bytes are valid utf-8 | ||
| // the last character is not included, because it may be incomplete | ||
| let last = bytes | ||
| .iter() | ||
| .rev() | ||
| .find_position(|b| is_char_boundary(**b)) | ||
| .map_or(0, |(offset, _)| bytes.len() - (offset + 1)); | ||
| assert!(last >= valid_bytes); | ||
| if std::str::from_utf8(&bytes[valid_bytes..last]).is_ok() { | ||
| tokens.push(i); | ||
| valid_bytes = last; | ||
| continue 'keep; | ||
| } else { | ||
| bytes.truncate(bytes.len() - bpe.token_len(i as u32)); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First time actually looking at this code :)
I'm not so sure that this procedure has any reasonable chance of constructing valid utf8 from "broken" tokens.
Essentially there are VERY few of those tokens and the chance of picking the correct one is extremely low.
The original idea of picking tokens was that the chance of picking longer interesting token sequences is higher.
But, this doesn't really work with broken utf8 sequences so easily. You would have to store valid token pairs. But there are just too many pairs that this is computationally reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I did a quick test and for both cl100k and o200k around 99% of the tokens are valid utf-8. o200k has more tokens containing multi-byte characters. For cl100k, almost 95% match the old condition token.iter().all(|b| is_char_boundary(*b)), but only 64% of o200k tokens does.
I guess I can simplify this function a bit by only testing for valid utf-8 tokens.
| use bpe::byte_pair_encoding::BytePairEncoding; | ||
| use rand::{thread_rng, Rng}; | ||
|
|
||
| pub fn create_test_bytes(bpe: &BytePairEncoding, tokens: usize) -> Vec<u8> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I "think" this function was testing more corner cases than the create_test_string variation 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably true, yes. I'll see where I can reinstate that.
Replace look-ahead with multiple patterns ==> 3x speedup
Co-authored-by: Alexander Neubeck <[email protected]>
Changes:
create_test_stringfunction tobpeso it's easier to reuse and updated it so it can also generate strings with multi-byte characters now.