Science is increasingly asking artificial intelligence machines to help us search and interpret huge collections of data, and it’s making a difference.

But unfortunately, polymer chemistry — the study of large, complex molecules — has been hampered in this effort because it lacks a crisp, coherent language to describe molecules that are not tidy and orderly.

Think nylon. Teflon. Silicone. Polyester. These and other polymers are what the chemists call “stochastic,” they’re assembled from predictable building blocks and follow a finite set of attachment rules, but can be very different in the details from one strand to the next, even within the same polymer formulation.

Plastics, love ’em or hate ’em, they’re here to stay.
Foto: Mathias Cramer/temporealfoto.com

Chemistry’s old stick and ball models and shorthand chemical notations aren’t adequate for a long molecule that can best be described as a series of probabilities that one kind of piece might be in a given spot, or not.

Polymer chemists searching for new materials for medical treatments or plastics that won’t become an environmental burden have been somewhat hampered by using a written language that looks like long strings of consonants, equal signs, brackets, carets and parentheses. It’s also somewhat equivocal, so the polymer Nylon-6-6 ends up written like this: 

{<C(=O)CCCCC(=O)<,>NCCCCCCN>}

Or this,

{<C(=O)CCCCC(=O)NCCCCCCN>}

And when we get to something called ‘concatenation syntax,’ matters only get worse.  

Stephen Craig, the William T. Miller Professor of Chemistry, has been a polymer chemist for almost two decades and he says the notation language above has some utility for polymers. But Craig, who now heads the National Science Foundation’s Center for the Chemistry of Molecularly Optimized Networks (MONET), and his MONET colleagues thought they could do better.

Stephen Craig

“Once you have that insight about how a polymer is grown, you need to define some symbols that say there’s a probability of this kind of structure occurring here, or some other structure occurring at that spot,” Craig says. “And then it’s reducing that to practice and sort of defining a set of symbols.”

Now he and his MONET colleagues at MIT and Northwestern University have done just that, resulting in a new language – BigSMILES – that’s an adaptation of the existing language called SMILES (simplified molecular-input line-entry system). They they think it can reduce this hugely combinatorial problem of describing polymers down to something even a dumb computer can understand.

And that, Craig says, should enable computers to do all the stuff they’re good at – searching huge datasets for patterns and finding needles in haystacks.

The initial heavy lifting was done by MONET members Prof. Brad Olsen and his co-worker Tzyy-Shyang Lin at MIT who conceived of the idea and developed the set of symbols and the syntax together. Now polymers and their constituent building blocks and variety of linkages might be described like this:

Examples of bigSMILES symbols from the recent paper

It’s certainly not the best reading material for us and it would be terribly difficult to read aloud, but it becomes child’s play for a computer.

Members of MONET spent a couple of weeks trying to stump the new language with the weirdest polymers they could imagine, which turned up the need for a few more parts to the ‘alphabet.’ But by and large, it holds up, Craig says. They also threw a huge database of polymers at it and it translated them with ease.

“One of the things I’m excited about is how the data entry might eventually be tied directly to the synthetic methods used to make a particular polymer,” Craig says. “There’s an opportunity to actually capture and process more information about the molecules than is typically available from standard characterizations. If that can be done, it will enable all sorts of discoveries.”

BigSMILES was introduced to the polymer community by an article in ACS Central Science last week, and the MONET team is eager to see the response.

“Can other people use it and does it work for everything?” Craig asks. “Because polymer structure space is effectively infinite.” Which is just the kind of thing you need Big Data and machine learning to address. “This is an area where the intersection of chemistry and data science can have a huge impact,” Craig says.