Cleaning US trademark data to analyse trends in ownership

This topic has 11 replies, 4 voices, and was last updated November 9, 2021 at 8:30 pm by jmc.

Creator

Topic
November 2, 2021 at 12:39 pm #247075
jmc
A modern corporation is likely to possess at least one trademark, as a trademark is the type of intellectual property one uses to register the ownership of images, designs or symbols. These images, designs or symbols, like the Nike swoosh or the BMW quartered circle, are central to protecting a company’s brand identity. Culturally, when we think of a brand identity, we tend not to think of intellectual property law; only the most detail-oriented lawyer would know the serial numbers of Apple from memory. Yet ownership of a trademark is always below the surface of our cultural relationships with brands. Trademark ownership enables, when needed, for a company like Apple to act against any party it believes is creating a design that looks a lot like the silhouette of an apple with a bite in it, or against anyone that is unauthorized to reproduce Apple’s logos. Ownership also forces someone to pay for the legal usage of someone else’s trademark — excluding the terms of “fair use”.

If a modern corporation builds its brand with trademarks, how many trademarks does a company typically own? The brand power of some corporations are influential culturally and successful financially. Do these corporations own more trademarks than others? For instance, how many trademarks do companies like Nike or Disney own?

Questions like these can be answered with statistical software, but doing so requires that the data are first cleaned. A dataset of trademarks is essentially a number of case files stacked on top of each other. The information of one trademark occupies one row in the dataset. The information in that row can be accurate, but alternatively it could have typos or misspellings. While errors like typos or misspellings are at most minor annoyances to a human reader, they severely hamper the production of quality aggregate data with software. Basic grouping functions in software such as R and Python will only match names strings that are exactly the same.

This document describes the steps I took to clean the ownership data of the United States Patent and Trademark Office (USPTO). The goal of this cleaning was to increase the likelihood that when trademarks were counted per owner (with Python, R, etc.), owners were not separated according to small differences in strings, such as case or typos in names.

Once cleaned, the ownership data can support critical research in intellectual property. For now, the version 1 of the cleaned and grouped USPTO ownership data is available here. Questions or suggestions are welcome, as the goal is to develop a dataset that helps researchers find trends or interesting things in trademark data.
- This topic was modified 3 years, 8 months ago by jmc.
- This topic was modified 3 years, 8 months ago by jmc.
Creator

Topic

Viewing 5 reply threads

Author

Replies
- November 2, 2021 at 1:50 pm #247077
  Scot Griffin
  Do you distinguish between live and dead marks? Also, do the data allow you to distinguish between marks that are actually registered versus those that are only applications?
  
  A quick look at Mattel marks on TESS implies to me that some of the hits from my search are for intent to use applications that are not yet registered marks (and may never be).
  
  Also, you might want to reach out to Thomas McCarthy, a former law professor at University of San Francisco who wrote the most widely known US TM law treatise. It looks like he is still around, and he might be interested in a clean and robust TM dataset (or even have access to one already). Links below:
  
  https://www.mofo.com/people/j-mccarthy.html
  
  https://www.mccarthyinstitute.com/
  
  P.S. Different industry sectors have different approaches to trademarks. Some are very aggressive in applying for marks, and others don’t seem to care much beyond their corporate name. If you were able to add S&P GICS sector tags to the data (e.g., consumer discretionary for Mattel and IT for Apple), that would make the dataset more useful for intra-sector and cross-sector analysis.
  - November 2, 2021 at 2:17 pm #247080
    jmc
    Scot Griffin wrote:
    
    P.S. Different industry sectors have different approaches to trademarks. Some are very aggressive in applying for marks, and others don’t seem to care much beyond their corporate name. If you were able to add S&P GICS sector tags to the data (e.g., consumer discretionary for Mattel and IT for Apple), that would make the dataset more useful for intra-sector and cross-sector analysis.
    
    100%. In another post about Moure’s paper, I found a Compustat application that counts patents per firm in the Compustat database. Funnily enough, they use fuzzy search to match the names — which is what I am using here. I have all the Compustat data, so I can try merging the two and see what happens.
    
    This reply was modified 3 years, 8 months ago by jmc.
- November 2, 2021 at 2:13 pm #247078
  jmc
  Scot Griffin wrote:
  
  Do you distinguish between live and dead marks? Also, do the data allow you to distinguish between marks that are actually registered versus those that are only applications? A quick look at Mattel marks on TESS implies to me that some of the hits from my search are for intent to use applications that are not yet registered marks (and may never be).
  
  I did not distinguish because filtering by live, dead, registered or applied is a step one would take later. The document and data are meant to prepare the dataset for all sorts of questions around onwership. The USPTO data schema helps explain the issue. The owner data is separated from the case file data, and TESS merges the data when a query is made.
  - This reply was modified 3 years, 8 months ago by jmc.
- November 2, 2021 at 2:15 pm #247079
  jmc
  Scot Griffin wrote:
  
  Also, you might want to reach out to Thomas McCarthy, a former law professor at University of San Francisco who wrote the most widely known US TM law treatise. It looks like he is still around, and he might be interested in a clean and robust TM dataset (or even have access to one already). Links below: https://www.mofo.com/people/j-mccarthy.html https://www.mccarthyinstitute.com/
  
  Thanks, I’ll look into it and send an FYI. I have some research questions of my own, but I am sharing my prep work for anyone that would want the results of cleaned names.
  - November 3, 2021 at 5:13 pm #247088
    Scot Griffin
    jmc wrote:
    
    Scot Griffin wrote:
    
    Also, you might want to reach out to Thomas McCarthy, a former law professor at University of San Francisco who wrote the most widely known US TM law treatise. It looks like he is still around, and he might be interested in a clean and robust TM dataset (or even have access to one already). Links below: https://www.mofo.com/people/j-mccarthy.html https://www.mccarthyinstitute.com/
    
    Thanks, I’ll look into it and send an FYI. I have some research questions of my own, but I am sharing my prep work for anyone that would want the results of cleaned names.
    
    I may be able to help you with developing (or even answering some of) your research questions. I have thirty years of experience in intellectual property law, mostly focused on patent licensing and IP strategy, generally, but with some trademark practice, and I am able to see things in the data that lay people are unlikely to see.
    
    –Scot
- November 2, 2021 at 4:45 pm #247084
  CM
  This is really awesome James, thanks for sharing! In glancing through it, it is funny/interesting to see some firms so high up on the list (like WWE??).
  
  It also seems there are separate entries for Apple inc and Apple Computer inc., leading to a much lower rank for that company.
  
  I don’t know if you have read K Birch, D Cochrane, and C Ward’s article “Data as Asset?”, but I was fascinated by their finding that the big tech firms seem to hold significantly lower than average intangible assets, and at least for Google and Facebook, slightly higher than average tangible assets. Just goes to show the diversity in how intellectual property/intangible assets are capitalized/capitalized upon, and how little we actually know about the so-called new knowledge economy.
  
  Article:
  
  Birch K, Cochrane D, Ward C. Data as asset? The measurement, governance, and valuation of digital personal data by Big Tech. Big Data & Society. January 2021. doi:10.1177/20539517211017308
  - November 2, 2021 at 7:28 pm #247085
    jmc
    CM wrote:
    
    This is really awesome James, thanks for sharing! In glancing through it, it is funny/interesting to see some firms so high up on the list (like WWE??). It also seems there are separate entries for Apple inc and Apple Computer inc., leading to a much lower rank for that company. I don’t know if you have read K Birch, D Cochrane, and C Ward’s article “Data as Asset?”, but I was fascinated by their finding that the big tech firms seem to hold significantly lower than average intangible assets, and at least for Google and Facebook, slightly higher than average tangible assets. Just goes to show the diversity in how intellectual property/intangible assets are capitalized/capitalized upon, and how little we actually know about the so-called new knowledge economy. Article: Birch K, Cochrane D, Ward C. Data as asset? The measurement, governance, and valuation of digital personal data by Big Tech. Big Data & Society. January 2021. doi:10.1177/20539517211017308
    
    Thanks, Chris. The fine tuning of matching is something I will try to implement if/when I build another version. IMO, there is likely always going to be some manual grouping: e.g., counting Alphabet and Google as one.
    
    The total counts are skewed by the number of years companies are alive. Tesla Motors has 55 trademarks and Ford Motor Company has thousands. If we look trademarks per year (registered minus cancelled), they might be closer in counts.
- November 3, 2021 at 11:32 am #247086
  Blair Fix
  Really interesting analysis, James. A few questions.
  
  1. Are you chunking the data so you can run the analysis in parallel? Or to reduce the memory load on you computer?
  
  2. I wonder what you’d find if you calculate the ratio of trademarks to dollars of profit (or some other measure of income). It would be interesting to see how that changes the results.
  - November 3, 2021 at 11:49 am #247087
    jmc
    1. Memory load. Everything is re-merged and the fuzzy search is done one more time.
    
    2. If you look at the names, some sectors def. rely on the number of trademarks they register. After seeing the results, it makes sense that MATTEL is the biggest. Hollywood is all over the top ranks. WWE is the largest wrestling league in the world. IGT is the biggest gambling, slot-machine company. ARISTOCRAT TECHNOLOGIES AUSTRALIA PTY LTD. is the second biggest.
- November 9, 2021 at 5:54 pm #247151
  Blair Fix
  On a technical note, when I’m dealing with a dataset that’s too big for my computer’s memory, I use the Unix split command, which divides the file into more manageable chunks: https://kb.iu.edu/d/afar
  - November 9, 2021 at 8:30 pm #247153
    jmc
    Cool, I’ll look into it. For now, chunking via pandas is working.
Author

Replies

Viewing 5 reply threads

You must be logged in to reply to this topic.