By Aaron Gokaslan* and Vanya Cohen*
Recently, large language models like BERT¹, XLNet², GPT-2³, and Grover⁴ have demonstrated impressive results in generating text and on multiple NLP tasks. Since Open-AI has not released their largest model at this time (but has released their 774M param model), we seek to replicate their 1.5B model to allow others to build on our pretrained model and further improve it.
You can access the model and generate text using our Google Colab.
We’ve also made the model weights available separately.
Radford’s et al’s³ security strategy of delaying the release of the model relies on these models being difficult to replicate and requiring a high degree of specialized domain knowledge. We demonstrate that many of the results of the paper can be replicated by two masters students, with no prior experience in language modeling. Because of the relative ease of replicating this model, an overwhelming number of interested parties could replicate GPT-2. Further, Zellers et al.⁴ shows that large language models like GPT-2 are an invaluable tool for countering the use of the same models as text generators.
Because our replication efforts are not unique, and large language models are the current most effective means of countering generated text, we believe releasing our model is a reasonable first step towards countering the potential future abuse of these kinds of models.
We base our implementation off of the Grover model⁴ and modify their codebase to match the language modeling training objective of GPT-2. Since their model was trained on a similarly large corpus, much of the code and hyperparams proved readily reusable. We did not substantially change the hyperparameters from Grover.
The cost of training the model from scratch using our code is about $50k. It’s important to note this figure is the estimated value of the cloud compute, and does not reflect the much smaller intrinsic costs involved (training the model is less if training on other less time-efficient, user-friendly compute resources).
There is a significant time-cost tradeoff, and slower training methods have considerably smaller costs, thus reducing the barrier to entry.
The original paper provided minimal details on how the dataset was cleaned.
As in WebText³, we begin by parsing out all links from Reddit with more than 3 up-votes. We started with the Pushshift Reddit scrape⁵, a dataset containing a continuously updated collection of Reddit posts, comments, and related metadata. These links are then filtered to remove direct links to file-types unlikely to contain usable text or HTML (i.e. video files, PDFs, and CSS style files).
We also filter webpages to remove Wikipedia as it is used by various evaluation benchmarks and datasets. We were not able to determine if our filtering criteria matched OpenAI’s since this information was never released. Text was extracted from HTML pages using the Newspaper Python library, and then filtered for only English text using the fastText Python library⁶. Specifically we use the WhatTheLang python Wrapper⁷. We deduplicate documents using locally sensitive hashing (LSH)⁸ ⁹ ¹⁰. We hashed the documents into sets of 5-grams and all documents that had a similarity threshold of greater than 0.5 were removed.
As a cleaning heuristic, documents with fewer than 128 tokens were removed from the dataset. These shorter documents tended to be lower quality, as determined by text coherence. We release this dataset as the OpenWebTextCorpus¹¹.
For encoding the dataset, we used the Binary Pattern Encoder¹² released with the small models from Radford et al.³
We used a modified version of the OpenWebText web-scraping codebase¹³ as a starting point for our dataset collection.
From the publicly released collection of 260k documents from WebText³, we find that all have a minimum byte-pair (BPE) encoding¹² length of 40, and a maximum of 1024. OpenWebText differs in that we set a lower bound for document length at 128 tokens (instead of BPE codes), and do not restrict the maximum document length. The original WebTextCorpus was released before these samples became available and therefore did not use the information for generating cleaning heuristics.
We made multiple attempts to contact Radford et al.³ to clarify evaluation and model details, but were ultimately unsuccessful.
Despite the differences in our training distribution, we do report similar perplexities over most datasets.
Prompt: “Recycling is good for the world. NO! YOU COULD NOT BE MORE WRONG!!”
Recycling is good for the world. NO! YOU COULD NOT BE MORE WRONG!! That´s because recycling is all about selling to customers who cannot afford to recycle.” If you don’t have any interest in recycling, in whatever form, then you could well be doing your own global environmental benefit if you just got rid of it. Let me explain. Recycling takes time, money and energy, and it doesn’t really solve the problem. If you go to recycling drop-off shops and they just sort out your recyclables for you then it means there is no need to sell them again to the local recycling company; they just take it back from you and you can just dump it. There is nothing clever about this approach, in fact it just shuffles the problem around instead of solving it. If you put the recycling drop-off shops under your local council then you could actually find more recycling opportunities for local communities, as well as more time for recycling. Recycling does have some short term upsides, but only in cases where you don’t have an immediate need for the kind of plastic you are recycling; then at least the glass, paper, steel and plastic products are free to dispose of. But this is the case only in the short term. In the long term, in the age of climate change, you will need to take more care of what you do with your plastics or at least put it into our recycling system. I’m sure there are lots of companies out there that could invest money and effort in dealing with the issues of waste plastics. Instead of just shredding them, or leaving them in the ground, companies could turn them into good quality recycling products, or turn them into energy. There are other options as well, such as taking the material and using it to make a tree pulp or textiles. However, it doesn’t have to cost anything to divert waste plastics and plastics derived from energy, to something that actually has a long-term role in the world; that´s the whole idea. We are wasting tonnes of energy because we don’t recycle our waste plastics, and if we did then we could save that energy for our own needs. This is a far greater percentage of the energy budget of the world than just for producing energy, but it is still a lot. And the fuel used to make the plastics is costing more and more: Another short term gain from recycling comes when you remove disposable plastics from the landfills and recycling centres; those plastic bags, bottles, filters, rolls and clothes go straight into the world’s oceans, which are in big trouble right now, as poor diets contribute to pollution in the sea, sea life are starving and thousands of species are on the brink of extinction. Many companies are developing plastic “gifts” for the children of developing countries, but much more needs to be done. Companies can do a lot when they commit to the recycling of plastic goods. So, at this point, recycling is all about consumer habits. That´s really all it is. If people choose to separate and segregate their waste plastics, then recycling will definitely have more of an impact than if they don’t. Join the world’s largest recycling network: RecycleAmerica , featuring the largest collection of energy recovery and energy bond companies, designers, and manufacturers in the US. Don’t believe me? Find out more. Contribute to the global efforts to make plastics more sustan