Compiler

Deterministic Finite Automata

A Deterministic Finite Automata simulator.

Note

This file is part of Empirical, https://github.com/devosoft/Empirical

Copyright

Copyright (C) Michigan State University, MIT Software license; see doc/LICENSE.md

Date

2016-2021.

Note

Status: BETA

namespace emp

If we are in emscripten, make sure to include the header.

Typedefs

using DFA = tDFA<128, uint8_t>

Setup DFA to be a simple tDFA with the basic character set for symbols.

template<int NUM_SYMBOLS = 128, typename STOP_TYPE = uint8_t>
class tDFA

Public Types

using stop_t = STOP_TYPE

Public Functions

tDFA(size_t num_states = 0)
tDFA(const tDFA<NUM_SYMBOLS, STOP_TYPE>&) = default
~tDFA()
tDFA<NUM_SYMBOLS, STOP_TYPE> &operator=(const tDFA<NUM_SYMBOLS, STOP_TYPE>&) = default
size_t GetSize() const

How many states is this DFA using?

void Resize(size_t new_size)

Add Additional empty states.

const emp::array<int, NUM_SYMBOLS> &GetTransitions(size_t from) const

Return an array of all transitions associated with a specified state.

void SetTransition(size_t from, size_t to, size_t sym)

Add a specific transition associated with an input symbol.

void SetStop(size_t state, stop_t stop_val = 1)

Set the stop value (no matter what it currently is)

void AddStop(size_t state, stop_t stop_val = 1)

Set the stop value only if it’s higher than the current stop value.

stop_t GetStop(int state) const

Get the stop value associated with a state.

bool IsActive(int state) const

Test if a state is still valid.

bool IsStop(int state) const

Test if a state has a stop.

stop_t GetStop(size_t state) const
bool IsActive(size_t) const
bool IsStop(size_t state) const
int Next(int state, size_t sym) const

Return the new state after a symbol occurs.

int Next(int state, std::string sym_set) const

Return the new state after a series of symbols.

stop_t Test(const std::string &str) const

Determine if an entire series of symbols is valid.

void Print(std::ostream &os = std::cout)

Print details about this DFA.

Private Members

emp::vector<emp::array<int, NUM_SYMBOLS>> transitions
emp::vector<STOP_TYPE> is_stop

Lexer Utilities

A set of utilities to convert between NFAs and DFAs.

Note

This file is part of Empirical, https://github.com/devosoft/Empirical

Copyright

Copyright (C) Michigan State University, MIT Software license; see doc/LICENSE.md

Date

2016-2017

Note

Status: BETA

namespace emp

If we are in emscripten, make sure to include the header.

Functions

const DFA &to_DFA(const DFA &dfa)

Converting DFA to DFA no change needed.

const NFA &to_NFA(const NFA &nfa)

Converting NFA to MFA no change needed.

DFA to_DFA(const NFA &nfa, int keep_invalid = false)

Systematic conversion of NFA to DFA…

NFA to_NFA(const DFA &dfa)

Systematic up-conversion of DFA to NFA…

template<typename T1>
NFA MergeNFA(T1 &&in)

Merge multiple automata into one NFA (base case, single converstion)

template<typename T1, typename T2, typename ...Ts>
NFA MergeNFA(T1 &&in1, T2 &&in2, Ts&&... others)

Merge multiple automata (DFA, NFA, RegEx) into one NFA.

template<typename T1, typename T2, typename ...Ts>
DFA MergeDFA(T1 &&in1, T2 &&in2, Ts&&... others)

Merge multiple automata (DFA, NFA, RegEx) into one DFA.

std::string FindExample(const DFA &dfa, const size_t min_size = 1)

Method to find an example string that satisfies a DFA.

struct DFAStatus
#include <lexer_utils.hpp>

Structure to track the current status of a DFA.

Public Functions

DFAStatus(size_t _state, const std::string &_seq)

Public Members

size_t state
std::string sequence

Lexer

A general-purpose, fast lexer.

Note

This file is part of Empirical, https://github.com/devosoft/Empirical

Copyright

Copyright (C) Michigan State University, MIT Software license; see doc/LICENSE.md

Date

2016-2019.

Note

Status: ALPHA

namespace emp

If we are in emscripten, make sure to include the header.

class Lexer
#include <Lexer.hpp>

A lexer with a set of token types (and associated regular expressions)

Public Functions

Lexer()
~Lexer()
size_t GetNumTokens() const

How many types of tokens can be identified in this Lexer?

bool TokenOK(int id) const
int AddToken(const std::string &name, const std::string &regex, bool save_lexeme = true, bool save_token = true, const std::string &desc = "")

Add a new token, specified by a name and the regex used to identify it. Note that token ids count down with highest IDs having priority.

int IgnoreToken(const std::string &name, const std::string &regex, const std::string &desc = "")

Add a token to ignore, specified by a name and the regex used to identify it.

int GetTokenID(const std::string &name) const

Get the ID associated with a token type (you provide the token name)

const TokenInfo &GetTokenInfo(int id) const

Get the full information about a token (you provide the id)

std::string GetTokenName(int id) const

Get the name associated with a token type (you provide the ID)

bool GetSaveToken(int id) const

Identify if a token should be saved.

void Generate() const

Create the NFA that will identify the current set of tokens in a sequence.

Token Process(std::istream &is)

Get the next token found in an input stream. Do so by examining one character at a time. Keep going as long as there is a chance of a valid lexeme (since we want to choose the longest one we can find.) Every time we do hit a valid lexeme, store it as the current “best” and keep going. Once we hit a point where no other valid lexemes are possible, stop and return the best we’ve found so far.

Token Process(std::string &in_str)

Shortcut to process a string rather than a stream.

emp::vector<Token> Tokenize(std::istream &is)

Turn an input stream of text into a vector of tokens.

emp::vector<Token> Tokenize(const std::string &str)

Turn an input string into a vector of tokens.

emp::vector<Token> Tokenize(const emp::vector<std::string> &str_v)

Turn a vector of strings into a vector of tokens.

const std::string &GetLexeme()

Get the lexeme associated with the last token identified.

void Print(std::ostream &os = std::cout) const

Print the full information about this lexer (for debugging)

void DebugString(std::string test_string)

Try out the lexer on a string and demonstrate how it’s tokenized.

void DebugToken(int token_id) const

Private Members

emp::vector<TokenInfo> token_set

List of all active tokens.

emp::map<std::string, int> token_map

Map of token names to id.

int cur_token_id = MAX_ID

Which ID should the next new token get?

bool generate_lexer = false

Do we need to regenerate the lexer?

DFA lexer_dfa

Table driven lexer implementation.

std::string lexeme

Current state of lexeme being generated.

const TokenInfo ERROR_TOKEN = {"", "", ERROR_ID, true, true, "Unable to parse input!"}

Private Static Attributes

constexpr int MAX_ID = 255

IDs count down so that first ones have priority.

constexpr int ERROR_ID = -1

Code for unknown token ID.

struct Token
#include <Lexer.hpp>

Information about a token instance from an input stream.

Public Functions

Token(int id, const std::string &str = "", size_t _line = 0)
Token(const Token&) = default
Token(Token&&) = default
Token &operator=(const Token&) = default
Token &operator=(Token&&) = default
operator int() const

Token will automatically convert to its ID if used as an int.

operator const std::string&() const

Token will automatically convert to its matched sequence (lexeme) is used as a string.

Public Members

int token_id

Which type of token is this?

std::string lexeme

Sequence matched by this token (or empty if not saved)

size_t line_id

Which line did this token start on?

struct TokenInfo
#include <Lexer.hpp>

Information about an individual TYPE of token to be processed within a Lexer.

Public Functions

TokenInfo(const std::string &_name, const std::string &_regex, int _id, bool _save_l = true, bool _save_t = true, const std::string &_desc = "")
TokenInfo(const TokenInfo&) = default
TokenInfo(TokenInfo&&) = default
TokenInfo &operator=(const TokenInfo&) = default
TokenInfo &operator=(TokenInfo&&) = default
void Print(std::ostream &os = std::cout) const

Print out the status of this token (for debugging)

Public Members

std::string name

Name of this token type.

std::string desc

More detailed description of this token type.

RegEx regex

Pattern to describe token type.

int id

Unique id for token.

bool save_lexeme

Preserve the lexeme for this token?

bool save_token

Keep token at all? (Whitespace and comments are often discarded).

NonDeterministic Finite Automata

A Non-deterministic Finite Automata simulator.

To build a standard NFA, use

emp::NFA. If you want to have more symbols or more stop states, use emp::tNFA<S,T> where S is the number of symbols and T is the type used for stop. (defaults are 128 for ASCII-128 and uint8_t respectively.)
Note

This file is part of Empirical, https://github.com/devosoft/Empirical

Copyright

Copyright (C) Michigan State University, MIT Software license; see doc/LICENSE.md

Date

2016-2017

Note

Status: BETA

The constructor can take as parameters the number of states and the id of the start state (both default to 0)

Note

DFA’s use SetTransition(), but NFA’s use AddTransition. This distinction is intentional since in a DFA a second SetTransition with the same start state and symbol will override first, while in an NFA a second AddTransition will always add a new option.

namespace emp

If we are in emscripten, make sure to include the header.

Typedefs

using NFA = tNFA<128, uint8_t>

NFA is the most standard tNFA setup.

using NFA_State = tNFA_State<128, uint8_t>

NFA_State is the most standard tNFA_State setup.

template<size_t S = 128, typename STOP_TYPE = uint8_t>
class tNFA
#include <NFA.hpp>

A dynamic NFA class, for easily building non-determanistic finite automata.

Public Types

using opts_t = BitSet<NUM_SYMBOLS>
using stop_t = STOP_TYPE

Public Functions

tNFA(size_t num_states = 1, size_t start_state = 0)
tNFA(const tNFA<S, STOP_TYPE>&) = default
~tNFA()
tNFA<S, STOP_TYPE> &operator=(const tNFA<S, STOP_TYPE>&) = default
size_t GetSize() const

Return the current number of states.

const std::set<size_t> &GetStart() const

Return start state and all others reachable through empty transitions.

std::set<size_t> GetNext(size_t sym, size_t from_id = 0) const

Return the states reachable from the current state given the provided symbol.

std::set<size_t> GetNext(size_t sym, const std::set<size_t> from_set) const

return the states reachable from the current set of states given the provided symbol.

bool HasFreeTransitions(size_t id) const

Does the provided state have free transitions?

bool HasSymTransitions(size_t id) const

Does the provided state have symbol-transitions?

opts_t GetSymbolOptions(const std::set<size_t> &test_set) const

Return an emp::BitSet indicating the symbols available from the provided set of states.

void Resize(size_t new_size)

Change the number of available states.

size_t AddNewState()

Add a new state into the NFA and return its id.

void AddTransitionSymbol(size_t from, size_t to, size_t sym)

Add a transition between states ‘from’ and ‘to’ that can be taken with the provided symbol.

void AddTransition(size_t from, size_t to, const char sym)

Add a transition between states ‘from’ and ‘to’ that can be taken with a char symbol.

void AddTransition(size_t from, size_t to, const std::string &sym_set)

Add a transition between states ‘from’ and ‘to’ that can be taken with the provided symbols.

void AddTransition(size_t from, size_t to, const char *sym_set)

Add a transition between states ‘from’ and ‘to’ that can be taken with the provided symbols.

void AddTransition(size_t from, size_t to, const BitSet<NUM_SYMBOLS> &sym_set)

Add a transition between states ‘from’ and ‘to’ that can be taken with the provided symbols.

void AddFreeTransition(size_t from, size_t to)

Create a free transition between ‘from’ and ‘to’.

template<typename T = stop_t>
void SetStop(size_t state, T stop_val = 1)

Set the specified state to be a stop state (with an optional stop value.)

stop_t GetStop(size_t state) const

Get any stop value associated with the provided state.

bool IsStart(size_t state) const

Test if NFA begins at provided state (may have free transitions to other states)

bool IsStop(size_t state) const

Test if this state is a legal endpoint for the NFA.

bool IsEmpty(size_t state) const

Test if this state has only empty transitions from it, and not stop state.

void Merge(const tNFA<NUM_SYMBOLS, STOP_TYPE> &nfa2)

Merge another NFA into this one.

void Print() const

Print information about this NFA (for debugging)

void PrintFreeMoves()

Identify free moves in NFA (for debugging)

Public Static Attributes

constexpr const size_t NUM_SYMBOLS = S

Private Members

emp::vector<State> states

Information about available states.

size_t start

Main start state (others might be reached for free.)

emp::vector<STOP_TYPE> is_stop

0=no 1=yes (char instead of bool for speed)

struct State

Public Functions

State()

Public Members

std::map<size_t, Transition> trans

What symbol transitions are available?

std::set<size_t> free_to

What other states can you move to for free?

std::set<size_t> free_from

What other states can move here for free?

struct Transition

Public Functions

Transition()

Public Members

opts_t symbols
template<size_t NUM_SYMBOLS = 128, typename STOP_TYPE = uint8_t>
class tNFA_State
#include <NFA.hpp>

Information about the current full state (i.e., set of legal states) of an NFA.

Public Functions

tNFA_State(const tNFA<NUM_SYMBOLS, STOP_TYPE> &_nfa)
~tNFA_State()
const tNFA<NUM_SYMBOLS, STOP_TYPE> &GetNFA() const

Get the NFA associated with this state.

const std::set<size_t> &GetStateSet() const

Get a set of states that are currently active.

bool IsActive() const

Are there currently any legal NFA states?

bool IsStop() const

Can we legally stop in any of the current states?

bool HasState(size_t id)

Is a particular NFA state currently included?

size_t GetSize()

How many states are currently included?

void SetStateSet(const std::set<size_t> &in)

Set the current states.

void Reset()

Change current states to start + free transitions from start.

void Next(size_t sym)

Update states given a new input symbol.

void Next(const std::string &sym_set)

Update states given a new series of input symbols (as a string)

void Print()

Print out current information about this NFA State (for debugging)

Private Members

const tNFA<NUM_SYMBOLS, STOP_TYPE> &nfa

Which NFA is this state set associated with?

std::set<size_t> state_set

Which states are currently legal?

Regular Expressions

Basic regular expression handler.

A fully (well, mostly) functional regular expression processor.

Note

This file is part of Empirical, https://github.com/devosoft/Empirical

Copyright

Copyright (C) Michigan State University, MIT Software license; see doc/LICENSE.md

Date

2016-2019.

Note

Status: BETA

Special chars: ‘|’ - or ‘*’ - zero or more of previous ‘+’ - one or more of previous ‘?’ - previous is optional ‘.’ - Match any character except

Plus the following group contents (and change may translation rules) ‘(’ and ‘)’ - group contents ‘”’ - Ignore special characters in contents (quotes still need to be escaped) ‘[’ and ‘]’ - character set

choose ONE character ‘^’ as first char negates contents ‘-’ indicates range UNLESS first or last.

Additional overloads for functions in lexer_utils.h:

static NFA to_NFA(const RegEx & regex, int stop_id=1); static DFA to_DFA(const RegEx & regex);

namespace emp

If we are in emscripten, make sure to include the header.

Functions

NFA to_NFA(const RegEx &regex, size_t stop_id = 1)

Simple conversion of RegEx to NFA (mostly implemented in RegEx)

DFA to_DFA(const RegEx &regex)

Conversion of RegEx to DFA, via NFA intermediate.

class RegEx
#include <RegEx.hpp>

A basic regular expression handler.

Public Functions

RegEx() = delete
RegEx(const std::string &r)
RegEx(const RegEx &r)
~RegEx()
RegEx &operator=(const RegEx &r)

Set this RegEx equal to another.

std::string AsString() const

Convert the RegEx to an standard string, readable from outsite this class.

void AddToNFA(NFA &nfa, size_t start, size_t stop) const

Add this regex to an NFA being built.

void Generate() const

Assume the RegEx is ready and setup processing for it.

bool Test(const std::string &str) const

Test if a string statisfies this regex.

void PrintInternal() const

For debugging: print the internal representation of the regex.

void PrintNotes() const

For debugging: print any internal notes generated about this regex.

void PrintDebug() const

Print general debuging information about this regex.

Private Types

using opts_t = BitSet<NUM_SYMBOLS>

Private Functions

template<typename ...T>
void Error(T&&... args)
bool EnsureNext(char x)

Make sure that there is another element in the RegEx (e.g., after an ‘|’) or else trigger and error to report the problem.

Ptr<re_charset> ConstructSet()

Construct a character range.

Ptr<re_string> ConstructString()

Construct a string, loading everything needed.

Ptr<re_base> ConstructSegment()

Should only be called when we know we have a single unit to produce. Build and return it.

Ptr<re_block> Process(Ptr<re_block> cur_block = nullptr)

Process the input regex into a tree representaion.

Private Members

std::string regex

Original string to define this RegEx.

emp::vector<std::string> notes

Any warnings or errors would be provided here.

bool valid = true

Set to false if regex cannot be processed.

size_t pos = 0

Position being read in regex.

DFA dfa

DFA that this RegEx translates to.

bool dfa_ready = false

Is the dfa ready? (or does it need to be generated?)

re_block head

Private Static Attributes

constexpr size_t NUM_SYMBOLS = 128

Maximum number of symbol the RegEx can handle.

struct re_base

Internal base representation of a portion of a regex.

Public Functions

~re_base()
void Print(std::ostream &os) const
Ptr<re_block> AsBlock()
Ptr<re_charset> AsCharSet()
Ptr<re_parent> AsParent()
Ptr<re_string> AsString()
size_t GetSize() const
bool Simplify()
void AddToNFA(NFA &nfa, size_t start, size_t stop) const
struct re_block : public emp::RegEx::re_parent

Representation of a series of components…

Public Functions

void Print(std::ostream &os) const override
Ptr<re_block> AsBlock() override
bool Simplify() override
void AddToNFA(NFA &nfa, size_t start, size_t stop) const override
struct re_charset : public emp::RegEx::re_base

Representation of a character set e.g., [abc].

Public Functions

re_charset()
re_charset(char x, bool neg = false)
re_charset(const std::string &s, bool neg = false)
void Print(std::ostream &os) const override
Ptr<re_charset> AsCharSet() override
size_t GetSize() const override
char First() const
void AddToNFA(NFA &nfa, size_t start, size_t stop) const override

Public Members

opts_t char_set
struct re_or : public emp::RegEx::re_parent

Representation of two options in a regex, e.g., a|b.

Public Functions

re_or(Ptr<re_base> l, Ptr<re_base> r)
void Print(std::ostream &os) const override
void AddToNFA(NFA &nfa, size_t start, size_t stop) const override
struct re_parent : public emp::RegEx::re_base

Intermediate base class for RegEx components that have children (such as “and” and “or”)

Public Functions

re_parent()
~re_parent()
void Clear()
void push(Ptr<re_base> x)
Ptr<re_base> pop()
size_t GetSize() const override
Ptr<re_parent> AsParent() override
bool Simplify() override

Protected Attributes

emp::vector<Ptr<re_base>> nodes
struct re_plus : public emp::RegEx::re_parent

Representations of one-or-more instances of a component. e.g., a+.

Public Functions

re_plus(Ptr<re_base> c)
void Print(std::ostream &os) const override
void AddToNFA(NFA &nfa, size_t start, size_t stop) const override
struct re_qm : public emp::RegEx::re_parent

Representations of zero-or-one instances of a component. e.g., a?

Public Functions

re_qm(Ptr<re_base> c)
void Print(std::ostream &os) const override
void AddToNFA(NFA &nfa, size_t start, size_t stop) const override
struct re_star : public emp::RegEx::re_parent

Representations of zero-or-more instances of a component. e.g., a*.

Public Functions

re_star(Ptr<re_base> c)
void Print(std::ostream &os) const override
void AddToNFA(NFA &nfa, size_t start, size_t stop) const override
struct re_string : public emp::RegEx::re_base

Representation of strings stored in a RegEx.

Public Functions

re_string()
re_string(char c)
re_string(const std::string &s)
void Print(std::ostream &os) const override
Ptr<re_string> AsString() override
size_t GetSize() const override
void AddToNFA(NFA &nfa, size_t start, size_t stop) const override

Public Members

std::string str