oh wow, I had missed that article! Thanks, Mikey. I just read it--and yes, definitely, I see the parallels. These systems that have relied on good will and open sharing--the web and universities--are getting taken advantage of in this mad rush for AI training data.
"For many publishers and platforms, having their data crawled for training data felt less like trading and more like stealing."
Wow! I was shocked as I first reading this, but after having a little bit to sit with it, I guess its not surprising if we consider that these tech companies are already and will continually be searching for any and all high-quality human writing they can find to feed into their models. It seems like they can't eat their own dog food anyway.
Your extending the oil metaphor into consideration of scarcity also seems to undermine the narrative that LLMs will eventually be handling most (or all?) of our literate activity on our behalf. Their continued effectiveness would seem to rely on the continued availability of human writing to fuel them.
Feeling down about the future of open access data but weirdly a little more optimistic about the value of human writing. Great read!
Man, this cuts to the core of faculty anxiety. I raised the issue when they upgraded Blackboard Ultra with a built in AI assistant. There’s no mention what it pulls from your course when a user hits auto-generate. My guess is most of the written content is used. I have no idea what happens to the data.
The open movement in higher ed is strong. I’d hate to see data scraping destroy OER and other open practices. Great post!
oh wow, yes--I didn't even think about the course management systems and how they might be harvesting data from students. I would also guess that if they have the data, they're using it--unless there's an explicit statement in the T&C that they're not.
I'm also concerned about what this means for the culture of universities and open education resources. :/ Thanks for reading, Marc!
In combination with that Verge piece about the death of robots.txt, this story suggests the ole' tragedy of the commons is reaching its middle.
oh wow, I had missed that article! Thanks, Mikey. I just read it--and yes, definitely, I see the parallels. These systems that have relied on good will and open sharing--the web and universities--are getting taken advantage of in this mad rush for AI training data.
"For many publishers and platforms, having their data crawled for training data felt less like trading and more like stealing."
https://www.theverge.com/24067997/robots-txt-ai-text-file-web-crawlers-spiders
Yikes. That's spooky stuff. Thank you for writing about it!
yeah, it's a little grim. :/
Indeed...
Wow! I was shocked as I first reading this, but after having a little bit to sit with it, I guess its not surprising if we consider that these tech companies are already and will continually be searching for any and all high-quality human writing they can find to feed into their models. It seems like they can't eat their own dog food anyway.
Your extending the oil metaphor into consideration of scarcity also seems to undermine the narrative that LLMs will eventually be handling most (or all?) of our literate activity on our behalf. Their continued effectiveness would seem to rely on the continued availability of human writing to fuel them.
Feeling down about the future of open access data but weirdly a little more optimistic about the value of human writing. Great read!
yeah, I do feel a bit depressed about my own conclusion here--closing open access data.
I think you're totally right to point out the irony that the AI relies on human writing to continue. Unless synthetic data gets good enough, I guess?
Thanks for reading, Michael!
Man, this cuts to the core of faculty anxiety. I raised the issue when they upgraded Blackboard Ultra with a built in AI assistant. There’s no mention what it pulls from your course when a user hits auto-generate. My guess is most of the written content is used. I have no idea what happens to the data.
The open movement in higher ed is strong. I’d hate to see data scraping destroy OER and other open practices. Great post!
oh wow, yes--I didn't even think about the course management systems and how they might be harvesting data from students. I would also guess that if they have the data, they're using it--unless there's an explicit statement in the T&C that they're not.
I'm also concerned about what this means for the culture of universities and open education resources. :/ Thanks for reading, Marc!